Fundamentals of codon optimization unlock the secrets behind gene sequence design and protein expression.
click to enlarge
Protein gel analyzing expression levels of six codon-optimized scFv variants (in triplicate) and one vector only negative control. (Source: DNA2.0)
The expression of recombinant proteins in heterologous hosts (e.g., expression of a human protein in Escherichia coli) is a cornerstone of modern biotechnology, addressing vast applications in protein production. Unfortunately, many researchers face obstacles related not only to protein expression in heterologous hosts, but also to overexpression in native hosts. While considerable energy has been spent uncovering modifications that improve protein expression, no reliable and robust process has yet been developed.
Despite significant advances, the underlying problem of, optimizing codons for heterologous gene expression, or sequence optimization, remains the same: evolution does not select for protein overexpression in living systems. In living systems, ATP is the biological currency of the cell and evolution selects for the conservation of resources, not gluttonous production of metabolic byproducts. As a result, natural gene sequences differ significantly from gene sequences that overexpress when cloned behind a T7 promoter in E. coli growing in an Ehrlenmeyer flask filled with rich media (the preferred biological niche for molecular biologists).
In contrast to time-consuming cloning techniques or error-prone PCR amplification to obtain genes of interest for heterologous protein expression, direct gene synthesis offers researchers the opportunity to build any gene without the complications that surround extracting genes from biological materials. Gene synthesis has the added advantage of allowing a complete redesign of the nucleotide sequence, while retaining the exact amino acid sequence thanks to the degeneracy of the genetic code. Even though redesigning the gene sequence through codon optimization often drastically improves heterologous protein expression levels,1 there are many competing algorithms and theories on exactly how synthetic genes should be designed and which variables influence protein expression.2 There has never been a study that systematically examines the underlying variables that define protein expression levels, particularly in heterologous hosts, until now.
A combination of synthetic biology and design of experiment (DoE) was used to design and build systematically varied datasets that are independent of natural variable correlation invoked through evolutionary history unrelated to protein expression levels. Two gene datasets (phi29 DNA polymerase and scFv antibody fragment) were designed, synthesized, and expressed in a T7 E. coli expression system. Each gene set consists of 25 individually designed and synthesized variants. The relative protein expression levels were measured and machine-learning algorithms were used to identify and quantify the relevant variables. The protein expression levels from the synthetic genes with systematic silent mutations varied from below level of detection to 30% of cell mass. Variables controlling expression yield was identified and validated. All gene design was performed using DNA2.0’s Gene Designer software.3
The seven global design variables
Many different variables have been proposed to directly affect codon optimization (reviewed in references 1,3). The variables predominantly appearing in the literature are identified below, and were systematically varied for this study.
Codon Bias Table
Codon bias describes the relative frequency with which synonymous codons are used to encode an amino acid. Codon biases are typically calculated by counting the occurrence of each codon in a set of protein-coding genes. Two separate codon bias tables were used, one calculated from all of the genes in the E. coli genome (E. coli Table 1)4 and a second from a subset of 27 highly-expressed genes (E. coli Table 2). 5 Although no causative link has been demonstrated between the Codon Bias Table and high protein expression to date, the codon bias observed in the E. coli 2 subset of genes is commonly considered “optimal” for high expression. The Gene Designer software algorithm lets users select the codon bias table they wish to use, and then build sequences that approximate the bias of that table.
It has been shown that use of rare codons can lower protein expression, particularly when used frequently or in tandem.6 To determine the threshold of rare codon occurrence that impacts protein expression levels in our model systems, we used the Gene Designer software to vary the occurrence of rare codons and test whether inclusion of such codons had a detrimental effect on protein expression. At the low threshold settings, only the rarest codons were excluded, while at the high settings, a significant number of codons were excluded (exact number depends on codon table used). Even when the poor codons were allowed, they appeared rarely and only as defined by the chosen codon bias table.
Internal RNA Structure
Previous chemical footprinting of mRNA-ribosome complexes demonstrated that up to 20 codons (60 nucleotides) are protected by a single translating ribosome.7 Translation proceeds at approximately 18 codons/second with one ribosome initiating translation approximately every two seconds.8 As a result, only approximately 50 nucleotides are available for folding into an mRNA secondary structure between each translating ribosome. To identify the role that mRNA secondary structures play in protein expression, the potential mRNA secondary structure of each variant was determined using UNAFold.9 UNAFold is a software package for nucleic acid folding and hybridization prediction. Upon examination, the 50-nucleotide window from base-pair position +147 to +196 was found to be especially prone to forming strong mRNA hairpin structures. To test the influence of this particular structure on protein expression, we manually modified the sequence in selected gene variants to encode strong, medium, or weak RNA secondary structures. The introduced changes were made through minimal codon substitutions within the bias and threshold limits.
5’ mRNA Structure and Coding
The first 10 to 15 codons immediately downstream of the 5’ initiation ATG codon have been shown to influence gene expression levels through several means, including mRNA secondary structure or wobble position nucleotide.10, 11 To determine the influence of this codon region on protein expression, we again used UNAFold to calculate the mRNA secondary structure for the first 121 nucleotides from the 5’ end (including the ribosome- binding site). This structure was manually edited in selected gene variants to encode strong, medium, or weak RNA secondary structures. The introduced changes were made through minimal codon substitutions within the bias and threshold limits. We also manually modified the percentage of third position A or T in the first 15 codons.
Homopolymeric nucleotide runs, including runs of G/C, have been suggested, and in some cases shown, to introduce ribosome and RNA polymerase slippage, resulting in translational frameshifts and premature truncation of the encoded polypeptide. To fully assess the impact that G/C runs have on protein expression levels, the Gene Designer algorithm identified runs of six or more G/C nucleotides and introduced the number of G/C runs as a variable.
Percent Identity to Wild Type
In several internal and collaborative codon optimization experiments, synthetic genes were made using gene sequences designed to be as genetically distant as possible from the wild type sequence. These genes typically express proteins at high levels. Accordingly, the percent identity to the wild type gene was introduced as a gene design variable. This design variable was performed using the “as far away as possible” feature in Gene Designer.
Codon Adaptation Index (CAI)
The interest among researchers to characterize sequence variability in naturally derived genomes has served as the basis for identifying gene design variables that may increase protein expression. The assumption is that gene sequence characteristics present in naturally highly-expressed genes correspond to the characteristics needed to express heterologous genes. Variables such as popular codon adaptation index (CAI), which is a measure of abundance of the most common codon for a given amino acid, have been shown to correlate with expression level in genomic sequences.5 However, correlation does not imply causation. In an often-stated circular argument, it is assumed that because CAI correlates with elevated gene expression in natural genomic datasets, synthetic genes designed to encode high CAI are highly expressed. However, the correlation of variables such as CAI in a subset of genes is more likely to reflect evolutionary constraints involved in facilitating DNA replication, mutational bias, intrinsic metabolic regulation, transposon resistance, or ancestral origin than correspond to high expression of a recombinant protein when grown in rich media.12 It should also be noted that almost all of the highly expressed genes in the Sharp (1987) Class II dataset encode ribosomal proteins, a group of genes that differ significantly from non-ribosomal genes in several characteristics. To determine how strong the correlation is between CAI and increased protein expression, the CAI score was monitored in the dataset for these experiments and included in the modeling.
Heterologous protein expression
Past research has examined the expression of functional proteins in heterologous hosts,1 but never in terms of how the variables previously described independently or collectively influence overall expression levels. The application of synthetic genes to define the interplay of these variables provides an opportunity to identify the factors most critical to high protein yield, and builds this information into algorithms dedicated to the complex demands of protein engineering.
Typical increases in expression for codon-optimized mammalian proteins in E. coli are between five-fold and 15-fold.1 But in this systematic study, heterologous protein expression varied from non-detectable to approximately 30% of cell mass, based solely on codon optimization. These changes constitute a two order of magnitude difference in expression levels. Contrary to common belief, we show that most of the variables identified above do not correlate with increased heterologous protein expression.
Choice of codon table does not significantly affect expression levels, whereas higher codon threshold level and high CAI are negatively correlated with protein expression levels for both phi29 and scFv expression. Very high CAI may even be detrimental to protein expression. This result may come as a surprise for many, since the prevailing notion is that common codons are good codons. RNA secondary structure, either in the 5’ end or internal to the coding region, as well as A/T wobble, show some correlation to expression levels. However, the correlation is not consistent between the two datasets and could reflect gene specific elements or could be indirect effects of other variables such as global codon bias. “Runs of G/C” is a factor somewhat negatively correlated to protein expression and “distance from wild type” is a factor that is positively correlated to protein expression levels. This could be a direct effect or it could more likely be attributed to an indirect effect of altered global codon bias as discussed below.
Global codon bias
Interestingly, while numerous variables were tested for both the phi29 DNA polymerase and scFv antibody fragment, the genetic distance between a variant and the wild type sequence was, by far, the most dominant factor influencing expressed protein levels for phi29 (scFv is a non-natural gene, so no wild type sequence exist), and the avoidance of G/C runs was a distant second most important variable, but present in both sets. The two phi29 variants that differed most from the wild type sequence showed more than two-fold higher expression levels than the next best variant that represented the other variables tested. Both phi29 variants not only had significant sequence differences from the rest of the variants, but also had a very different codon bias compared to the other variants. Three possible hypotheses account for this trend:
Elimination of cis-regulatory Elements
One possible explanation for these results would be the difficulty in generating variant strains without forcing sequence dissimilarity, even in the absence of certain cis-regulatory elements that are present in the wild type gene. While this may be a plausible explanation, in the case of this study it is highly unlikely, as there were no contiguous strings of nucleotides longer than six conserved among variants 1-18. The lack of conserved sequences among the variants validates the idea that there would be very little chance of common sequence elements that would attenuate expression.
A second possibility is that there are specific local, rate-limiting codons or codon combinations that are only removed when codon choice is forced away from wild type.
Improved Global Codon Bias
The third explanation is that forcing the sequence away from wild type creates an improved global codon bias for one or more amino acids.
Testing the hypotheses
The three hypotheses were tested by constructing hybrids between the low- and high-expression variants. The phi29 and scFv genes were split into three segments and all combinatorials of high- and low-expressing variants were synthesized and analyzed via activity measurement. In theory, if the first or second hypotheses were accurate, one would expect to see localized effects of the introduced hybrid sequences. Instead, in both cases (i.e., phi29 and scFv) the results strongly correlated with an additive effect of the introduced segments, suggesting that improved global codon bias is, in fact, the variable that governs heterologous protein expression.
The results further indicated that not all codons are created equally, and some codons confer a much stronger effect on protein expression than others. In other words, while most codon optimization algorithms address global codon alterations, in reality, the need for optimization is much more localized to a handful of specific codons. To explore the effects of individual codons on protein expression levels, systematically varied sequences were reanalyzed with each individual codon applied as an independent variable. With this approach, statistically-relevant information was used to model both data sets with a high degree of correlation to protein expression levels.
While researchers have generally accepted that codon optimization often results in higher protein expression in heterologous hosts, until now, there has been no systematic in-depth research to determine which variables afford the greatest impact on overexpression, or how. Advances in molecular biology have handed scientists the tools to explore and characterize the genome of virtually any organism. Synthetic gene design takes this opportunity one step farther, providing a means for novel genetic engineering to exploit the benefits of gene and protein overexpression.
Throughout history, evolution has functioned to conserve resources. This is exemplified by gene sequence and structure, and the production of only the minimum amount of protein required for metabolic functions. While some codon optimization algorithms are based on evolutionary trends that assume that sequence homology ensures expression in heterologous hosts, the research presented here demonstrates that many of the common assumptions regarding codon optimization algorithms may be flawed. In the case of CAI, codon threshold, and RNA secondary structures, it is a classic situation where correlation does not imply causation. Just because the presence of certain sequence features correlate with genomic protein expression, it does not imply these features control protein expression levels.
Once due diligence is paid to identify the variables that matter most to codon optimization and protein overexpression, it becomes clear that global codon bias is the variable that has the strongest effect on the level of heterologous protein expression. The optimal codon bias is neither that of the average E. coli genomic codon bias nor that of the highly expressed Class II genes, but instead a discrete area separated from the standard codon bias regions. As new conclusions are derived from our ongoing National Science Foundation-funded codon optimization work, the underlying algorithms will be incorporated into new versions of the freely-available Gene Designer tool.
About the Author
Gustafsson’s professional background includes positions at Maxygen Inc., and Kosan Biosciences, and research, teaching, and post-doctoral positions at University of California, Santa Cruz; University of California, San Francisco; and University of Umeå (Sweden), where he received a BSc (Microbiology) and a PhD (Molecular Biology/Biochemistry).
Filed Under: Genomics/Proteomics