Following completion of the Human Genome Project, most studies of human genetic variation have centered on single nucleotide polymorphisms (SNPs). SNPs are numerous in individual genomes and serve as useful genetic markers in association studies across a population. These markers have been leveraged to identify genetic loci for disease risk and draw associations with numerous traits of interest. Despite their usefulness, SNPs do not tell the whole story. For example, most SNPs are associated with only a small increased risk of disease, and they usually cannot identify on their own which genes are causal. This has resulted in what many researchers have referred to as missing or hidden heritability.
There is growing awareness that another class of genetic variation, structural variants (variants that affect 50 base pairs or more), are responsible for most of the differences in sequence between any two human genomes. Recent studies have described these larger variants as occurring frequently in individuals and across populations, with a larger effect size than SNPs. Structural variants are already known to be causative for numerous conditions ranging from Tay-Sachs disease and ALS to fragile X syndrome and Duchenne muscular dystrophy.
One reason that that structural variants have not been studied more thoroughly is that they have remained difficult to detect with previously available technology. Until recently, most DNA sequencing methods were only able to sequence short fragments of DNA – on the order of 150 to 250 base pairs. While these short reads are useful for mapping to exons and identifying small variants, such as SNPs, they are not well suited for spanning larger structural variants. To further compound the problem, structural variants often occur in regions of repetitive or GC-rich DNA sequence, making it even more difficult for short-reads to map accurately with sufficient coverage. Due to this ascertainment bias, this class of genetic variation remains understudied and offers fertile ground for new genetic discoveries.
Recent efforts are beginning to apply newer long-read sequencing methods that offer several advantages for discovery and detection of larger variants. With average read lengths that surpass 12,000 base pairs, even very long structural variants can be spanned completely by a single read. Initial studies have focused on assessing the discovery power of long-reads in a few carefully curated human genomes. After identifying a ~7-fold increase in discovery power, subsequent studies have focused on exploring structural variation in global ethnic populations to better understand the extent and frequency of this variant class. Future studies will aim to map common structural variants in larger population cohorts, and ultimately identify causative structural variants that better explain heritable disease and result in novel disease gene discovery.
Following on the success of databases such as dbGaP and gnomAD, scientists recently announced plans to build a structural variation database by sequencing 1,000 Chinese genomes with single-molecule, real-time (SMRT) long-read sequencing technology. The database is expected to cover many disease types and to shed light on both common and rare structural variants that occur in the Chinese population. Once completed, this database will benefit future precision medicine studies by serving as a reference resource for better understanding the frequency and impact of structural variants.
This database project follows a recent high-quality de novo genome assembly of a Chinese individual based on SMRT sequencing. The assembly revealed numerous structural variants, including approximately 20,000 insertions and deletions. Of the variants affecting an exon, around 50 are specific to the Chinese genome assembly. This offers tantalizing evidence that population-specific structural variants will be essential to understanding medically relevant variation among global ethnic populations.
Similar results were seen in a project that applied SMRT sequencing to assemble a Korean genome, resulting in fully phased chromosomes representing 90 percent of the genome. The team identified and successfully phased a clinically relevant duplication of CYP2D6, the gene responsible for metabolizing approximately 25 percent of currently approved FDA drugs. Improving population-specific reference alleles and genetic testing methods for this important gene will lead to improved pharmacogenomics studies across ethnically diverse global populations.
Genome sequencing has resulted in tremendous progress in diagnosis of Mendelian disorders. Over 50 percent of Mendelian diseases now have a known disease-gene identified. However, there is still a troublingly low diagnosis rate for individuals with Mendelian diseases, which disproportionately impact children. Approximately 60 percent to 75 percent of these individuals are left undiagnosed even after comprehensive genetic analysis using whole-exome and whole-genome short-read sequencing methods. One hypothesis to explain the undiagnosed cases is that causative structural variants are being overlooked. Long-read sequencing offers much higher sensitivity for these larger rare variant types, and offers a path forward for the field.
A recently published study from a group led by Euan Ashley of Stanford University explored the use of SMRT sequencing in Mendelian disease. The team focused on an individual who remained undiagnosed after 20 years of attempts. The patient had endured countless hospital stays and a series of benign tumors, often in his heart. The symptoms were consistent with a vanishingly rare disease called Carney complex, but neither targeted genetic testing nor short-read whole genome sequencing were able to detect a causative mutation.
Ashley and his collaborators enlisted long-read SMRT sequencing, generating eight-fold coverage of the patient’s genome. They discovered a 2.2 kb deletion in PRKAR1A, the gene associated with Carney complex, and finally delivered a diagnosis to the patient.
In a recent review article entitled, “Towards Precision Medicine,” Ashley wrote that pathogenic structural variants are critical to the clinical utility of genomics, but noted that they are challenging to detect with technologies other than long-read sequencing due to their length and repetitive content.
Many studies using long-read sequencing methods have now reported upwards of 20,000 structural variants occurring in a human genome, across an ethnically diverse set of individuals. These findings indicate that this class of genetic variation has been systematically understudied, and under-reported in previous population genetics projects, and remains under-represented in our current databases.
Now that these genetic elements are accessible with long-read sequencing, there is great opportunity to link structural variant genotypes with disease phenotypes — giving drug discovery scientists a potential wealth of new targets and a better understanding of disease causation. This information could help drive real improvements in characterizing human health and treating or preventing disease as we move toward a new era of precision medicine in global populations.
Luke Hickey is a senior director of human biomedical sciences, and Aaron Wenger is a Senior Staff Scientist at PacBio.
Filed Under: Genomics/Proteomics