Powerful algorithms reduce noise and improve accuracy in ever-more important SNP genotype calling programs.
Question: How do humans learn to complete a task?
Quick answer: We follow instructions.
And how do computers learn how to complete a task?
The answer’s simple. Computers follow a series of detailed instructions called an algorithm to complete a task.
|
The field of genomics could never exist without its synergistic relationship with the field of bioinformatics. And likewise, bioinformatics could never exist without the algorithm. Bioinformatics has made it possible to sequence entire genomes and is now making it possible to utilize that sequence data to identify variation in the human genome. One major form of normal variation in the human genome is the single nucleotide polymorphism or SNP.
But one must be especially careful when calling a difference in sequence at a single base position a SNP, especially when the technology used to sequence that piece of DNA does not provide sufficient accuracy to identify a SNP. This is the problem with the process of SNP genotyping (otherwise known as SNP genotype calling). And the solution seems to be SNP genotype calling algorithms.
“SNP calling is really based on the algorithm used to identify DNA polymorphisms,” says Steve Lincoln, vice president of informatics, Affymetrix, Santa Clara, Calif. And according to Lincoln, there are two classes of technologies used for SNP calling. The first class includes capillary electrophoresis-based DNA sequencing. This technology can be used to sequence diploid DNA and to identify polymorphisms, even though it was not designed to do so. In general, when calling bases, these technologies were designed to yield a heterogeneous, single answer; i.e., more than one base at any given position in a gene.
The second class of technology was actually designed to perform SNP calling; these are the SNP chips. One SNP chip product is Affymetrix’s SNP 6.0. “Our SNP 6.0 array is quite good at reading out polymorphisms when there is a heterogeneous answer at any single spot. Because you have copies of DNA from mom and dad, you’re a heterozygote, so you have two bases at any one position [on homologous chromosomes]. And indeed, [SNP chips] are designed to deal with that,” says Lincoln.
Published literature on the SNP 6.0 demonstrates extremely high-quality calls, meaning that the calls are highly accurate and have high call rates. A call rate is defined as calling a SNP a SNP when it is present; call accuracy means that when the SNP is called, it is called correctly. Such high-quality calling is the result of internal work at Affymetrix as well as its work with outside collaborators on critical algorithms designed to do the calling. “And this high-quality call rate is necessary, especially when working with real-world data sets,” says Lincoln. “By real-world data sets, I mean data sets that are not perfect, where the DNA quality that you are trying to call SNPs in may come from samples that contain DNA that is either of insufficient quantity or insufficient quality.”
Noise reduction
When it comes to SNP genotype calling using microarray technology, this is how it generally works. “When looking across many different DNAs when you are trying to call any single SNP, you get to observe patterns,” says Lincoln. He gives the following example. If the sequence for a given dinucleotide is “AA”, then the signal for every “AA” in every piece of DNA sequenced should look exactly the same. The same principle applies for the signal for “AT” and “TT”. This pattern-based SNP genotype calling is produced by clustering algorithms found in SNP arrays produced by Affymetrix and other manufacturers.
“SNP genotype calling is the process of applying a technology to read that place in the genome and then basically extracting the data and calling it a ‘G’ or calling it an ‘A,'” says Dietrich A. Stephan, PhD, director and senior investigator, Neurogenomics Division, Translational Genomics Research Institute, Phoenix, Ariz.
|
The goal of Stephan’s research is to be able to use SNPs to paint chromosomes in order to analyze how these chromosomes flow through families and populations. Chromosome painting is a hybridization-based technology in which each chromosome in a haploid genome is differentially-labeled by hybridizing to a specific nucleic acid probe tagged with a unique fluorescent marker. One reason for performing chromosome painting with SNPs is to be able to identify abnormalities on that chromosome and how that might be linked to human disease. In the case of Stephan’s chromosome painting experiments, the specific nucleic acid sequences are SNPs. “So basically you look for the difference in a base at a specific position in a gene across a population that has a disease versus a population of individuals who don’t,” says Stephan.
Stephan and his colleagues have also developed an SNP genotype-calling algorithm called SniPer-High Density (SniPer-HD). Based on an expectation-maximum algorithm, SNiPer-HD was designed to reduce the number of false-positive associations between SNP markers and phenotype that arise from systematically miss-called SNPs. The result is a reduction in “noise” inherent to the SNP-calling process.
A researcher does not need experience in performing SNP calling to understand that it needs to be done accurately. One non-wet-lab scientist who has not done SNP calling is Michael Molla, PhD, Center for Bioinformatics fellow and research associate, Department of Biomedical Engineering, Boston University, Boston, Mass. Despite his lack of SNP-calling experience, Molla does understand how the genotyping part of SNP-calling works, especially since he has developed software for SNP calling while working as a consultant for NimbleGen Systems, Inc., Madison, Wis.
“My expertise is in machine-learning, primarily, and I made a machine-learning approach to interpret a SNP array,” says Molla. And here is his rationale for the design. “SNPs are so rare when you are looking at a random piece of the genome that the amount of noise inherent to gene chips will typically overwhelm that signal.” He attributes much of this noise to the fact that a mismatch probe on a SNP genotyping microarray might generate a higher signal level than its corresponding perfect-match probe. So a lot of times there will be a mismatch from the perfect-match probe. Statistical methods designed to try to deal with this noise often require too much fine-tuning for a quick-and-dirty SNP experiment. To facilitate this process of noise reduction, Molla has developed machine-learning devices that do not require as much fine-tuning as the parametric statistical methods; they have been used by Nimblegen Systems for their SNP calling system.
What’s the difference?
So how does one know that it is the SNP that is causing a difference in phenotype? Well, Eric Schadt, PhD, executive scientific director of research genetics, Rosetta Inpharmatics, a wholly-owned subsidiary of Merck located in Seattle, Wash., has the answer. “You genotype hundreds of thousands of markers that are surrogates of the causal SNP or causal marker or DNA variation that is causing the trait, but you’ll have a marker that is really close to that causal marker,” says Schadt. “You know once you have that association and you know the actual causal variation that the SNP is linking with, that you will have to do lots of pretty hard work.”
And according to Schadt and other experts, genetics researchers have hit a wall when it comes to more efficiently nailing-down causal mutations. In fact, he says that most of the published papers in Genome and Science do not identify the causal mutation, but just identify closely-linked markers that are highly-correlated with the causal mutation. One of the primary efforts to determine a causal mutation is called deep re-sequencing, where one “re-sequences the genes in the region that are really high-depth in individuals that are at the extremes of the phenotype you’re looking at,” says Schadt. Deep re-sequencing is a technique by which the DNA sequences suspected of containing an SNP are re-sequenced in order to pick up on rare SNP variants hidden in large human populations.
In addition to SNP-based variation in the genome, researchers have, over the last year, found that the copy number of a given gene can vary from person to person much more than was previously thought. As a result, the need to accurately identify copy number variation (CNV) has also increased dramatically. “We need copy number calling,” says Lincoln. “That is, we want to able to detect whether, at a particular locus in the genome, there is one copy, two copies, three, or more, in that person’s genome at that site.” Multi-probe, hybridization-based microarrays coupled with specific algorithms similar to those used for SNP genotype calling are also to determine CNV.
In summary, no matter what source of genomic variation one studies, highly accurate, algorithm-based genotype calling is necessary to produce reliable DNA sequence data. And, it is only based on these data that true genome-based drug discovery can flourish.
This article was published in Drug Discovery & Development magazine: January, 2008, pp. 46-49.
Filed Under: Genomics/Proteomics