Drug Discovery and Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE

Meet BigRNA, an AI foundation model that can predict gene expression from genomic sequences

By Brian Buntz | August 2, 2024

BigRNA is said to be the first transformer neural network for the discovery of RNA biology and therapeutics.

[Deep Genomics]

Two years ago, Deep Genomics was juggling about 40 different machine learning models to decode RNA biology. Now, they’ve revamped their approach with a single, powerful AI system known as BigRNA. This foundation model is capable of predicting how genes behave in different tissues, how they’re spliced, and how they interact with other cellular components — all from reading DNA sequences.

In training, BigRNA learned from thousands of genome-matched datasets, enabling it to predict tissue-specific RNA expression, splicing patterns, microRNA sites, and RNA binding protein specificity — all from raw DNA sequences.

Promising early results

In early tests, BigRNA showed strong capabilities, correctly predicting the effects of steric blocking oligonucleotides (SBOs) on increasing the expression of all four genes tested. “We trained BigRNA using RNA-Seq datasets and genomic sequences,” said Brendan Frey, Ph.D., founder and chief innovation officer of Deep Genomics. “BigRNA had never trained on oligonucleotide therapeutics.” But in tests, the model accurately forecasted the impact of steric blocking oligonucleotides (SBOs) on boosting the expression of all 4 genes tested while correctly predicting splicing outcomes for 18 out of 18 exons across 14 genes, including those linked to Wilson disease and spinal muscular atrophy,” as the bioRxiv paper noted.

Brendan Frey, Ph.D.

Brendan Frey, Ph.D.

BigRNA’s unexpected proficiency in designing oligonucleotide therapeutics highlights a notable advantage of foundation models: their ability to develop capabilities beyond their initial training. “The big breakthrough has been foundation models,” Frey said. “We can now build one model that can do it all, and even do things we hadn’t imagined.”

“It’s just one model with one data set. And so it’s scalable,” Frey said. “You can increase the size of the neural network and get a better model.”

Previously, each of Deep Genomics’ models was trained separately, and thus did not account for interdependencies. “The most obvious thing is there’s just no way to scale up 14 different machine learning models,” Frey said. “Just imagine that each of those models had its own carefully constructed data set you know, the researchers had to spend a whole bunch of time optimizing the training parameters, then doing validation and testing 40 different times.”

The company has also developed a platform known as the “AI Workbench,” which can analyze genomic data to find therapeutic targets and design RNA-based drug candidates for genetically-defined diseases.

The expanding universe of BigRNA

The size and scale of BigRNA are constantly evolving. “The model released a year ago had about a billion parameters, similar to GPT-2,” Frey said. “Our current internal version has about 100 billion parameters. We expect BigRNA to have upwards of a trillion parameters by the end of the year.”

This exponential growth in size is driven by the gigantic amount of data required to train the model. “In the early days of machine learning, we used to talk about thousands of data points,” Frey said. “Around 2005, we were excited about a million data points. Now, we’re not even talking about billions, but trillions of data points.”

Frey draws a comparison to OpenAI’s GPT-4, a trailblazing large language model rumored to have well over a trillion parameters. BigRNA is on track to reach a similar scale. “Biology is more complex than human discourse for several reasons: We don’t innately know the ‘language’ of biology; we’ve invented terminology and hypotheses about how it works,” Frey said. “Much of what’s happening in biology isn’t directly related to health and medicine.”

Unravelling such complexity requires significant computing resources. “We use GPU servers, and we have relationships with Google and Nvidia to get access to the hardware we need,” Frey reveals. “It’s all about GPUs and TPUs for this kind of work. It requires a huge investment of money to train these kinds of models. They’re not cheap to build.”

‘Totally different from what we saw a few years ago’

While bigger isn’t always better in machine learning, increasingly large models are pointing the way forward as more biotech companies embrace systems with hundreds of billions or even trillions of parameters, yielding surprising results. The ability of such large models to develop unexpected capabilities — while avoiding overfitting where a model excels on training data but fails to generalize well — defies traditional expectations and challenges long-held assumptions about model behavior.

“In the last couple of years, everything’s changed,” Frey said. “It turns out you can over-train the model, and yet it still generalizes really, really well,” he added, describing the phenomenon as “shocking.”

The rise of large foundation models challenges traditional assumptions in machine learning and statistics. These assumptions suggested that highly parameterized models would struggle to balance accurately capturing underlying patterns (low bias) with avoiding oversensitivity to random fluctuations (low variance).

To illustrate this concept, imagine a game of darts where machine learning models are analogous to players trying to hit the bullseye. A model with high bias would consistently miss in the same direction – like always throwing too far to the left. A model with high variance would have its darts scattered all over the board — sometimes close, sometimes far off. The goal was to find a sweet spot: throws that cluster tightly around the bullseye.

Now, picture a dart throw that initially veers off target before curving back to the bullseye, reminiscent of David Beckham’s arcing free kicks. Giant deep learning models seem to “bend” the rules of traditional machine learning in a similar way. They appear to be over-trained and initially head off-target, but somehow still manage to “hook” back in and generalize well to new data after more training.

“We haven’t fully figured it out, but this pertains to foundation models,” Frey said. “The key is you have to have a really big model for this to happen. If you have a small model, like just training linear regression, you’ll never need to worry about overfitting.”

Some pundits attribute this phenomenon to emergent intelligence. “It’s actually figuring out relationships that are truly an order or two of magnitude beyond what you would get using a traditional way of thinking about statistics and machine learning,” Frey said, while underscoring the mystery involved in the phenomenon, which is known as double descent.

“Whatever it is that’s happening, we are seeing a qualitative change in how these machine learning methods work,” Frey said. “It’s totally different from what we saw a few years ago.”


Filed Under: machine learning and AI
Tagged With: AI in drug discovery, Deep Genomics, genetic variant analysis, Machine Learning in Pharma, oligonucleotide design, personalized therapeutics, RNA biology prediction
 

About The Author

Brian Buntz

As the pharma and biotech editor at WTWH Media, Brian has almost two decades of experience in B2B media, with a focus on healthcare and technology. While he has long maintained a keen interest in AI, more recently Brian has made making data analysis a central focus, and is exploring tools ranging from NLP and clustering to predictive analytics.

Throughout his 18-year tenure, Brian has covered an array of life science topics, including clinical trials, medical devices, and drug discovery and development. Prior to WTWH, he held the title of content director at Informa, where he focused on topics such as connected devices, cybersecurity, AI and Industry 4.0. A dedicated decade at UBM saw Brian providing in-depth coverage of the medical device sector. Engage with Brian on LinkedIn or drop him an email at bbuntz@wtwhmedia.com.

Related Articles Read More >

From data to drug candidates: Optimizing informatics for ML and GenAI
Intrepid Labs
Intrepid Labs raises $7 million to expand AI-driven formulation platform
AI agents could shoulder 55% of biopharma work, Accenture/Wharton study finds
Lokavant’s Spectrum turns clinical-trial planning into a live simulation
“ddd
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest news and trends happening now in the drug discovery and development industry.

MEDTECH 100 INDEX

Medtech 100 logo
Market Summary > Current Price
The MedTech 100 is a financial index calculated using the BIG100 companies covered in Medical Design and Outsourcing.
Drug Discovery and Development
  • MassDevice
  • DeviceTalks
  • Medtech100 Index
  • Medical Design Sourcing
  • Medical Design & Outsourcing
  • Medical Tubing + Extrusion
  • Subscribe to our E-Newsletter
  • Contact Us
  • About Us
  • R&D World
  • Drug Delivery Business News
  • Pharmaceutical Processing World

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search Drug Discovery & Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE