Human Codon Prediction

This project was supervised by Prof. Rachel Kolodny. And will soon (hopefully) be published in a peer reviewed journal. Tentative paper name: "Using Transformers to Study Patterns in Human Endogenous and Viral Codon Use".

Keywords:

Transformers

Codon Prediction

Human Proteins

Human Viruses

Graphical Abstract

Abstract

The extent to which evolutionarily selected synonymous codons are optimized, as well as the overlapping signals underlying their selection, remains only partially understood. We investigate codon usage patterns in human genes, their variation across tissues, and the extent to which human-adapted viruses reflect these patterns.

To this end, we trained an encoder-decoder large language model (LLM) to predict codons from amino acid sequences conditioned on transcript expression levels across 54 human tissues. We then analyzed the model's performance on human and viral sequences. The accuracy of our model significantly outperformed frequency-based baselines by an unexpectedly large margin.

While highly expressed genes can be generally accurately predicted using naïve models, lower-expression groups exhibit distinct patterns that only the LLM predicted well. Notably, the improvement in prediction accuracy exceeds that gained by accounting for supply-to-demand adaptation (SDA), suggesting the influence of additional factors such as co-translational folding. This is further supported by our model's higher accuracy advantage for longer proteins.

Additionally, our model’s performance on human viruses improves significantly when the expression input is aligned with viral tissue tropism, indicating that viruses may adapt to human codon usage in ways that extend beyond simple frequency matching.

In summary, this study leverages recent advances in language modeling to uncover a surprising degree of complexity in human codon usage, highlight potential underlying signals, improve our estimation of the limit on codon optimization, explore viral adaptation to these patterns, and hopefully provide a potential tool for tissue-specific codon optimization or de-optimization. Further research is needed to determine the specific contributions of these selection forces and refine our understanding of their biological significance.

Results

Interactive plot of Human Model Accuracy vs. KL Divergence (from human codon bias) of each viral species:

Links

Link to the GitHub Repository

Human Codon Prediction | segal-noam