Drug Discovery and Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE

Moving beyond pilot purgatory: Scaling AI in drug discovery projects

By Brian Buntz | September 28, 2023

Cryo-EM structure of HTLV-1 intasome, 3D cartoon model, black background

3D representation of the HTLV-1 intasome (a type of nucleoprotein complex) structure as visualized through cryo-electron microscopy. [Adobe Stock]

The topic of AI in drug discovery and development may continue to garner significant attention, but harnessing AI effectively requires a nuanced strategy, and an ability to navigate between details and the bigger picture.

Andrew Anderson, vice president of innovation and informatics strategy at ACD/Labs, likens it to boiling water. “On a small scale, it’s straightforward. But boiling a tanker cart full of water? That’s a different ballgame,” he said.

Thinking through these aspects is vital to address the ‘pilot-purgatory’ problem, which frequently arises in projects involving emerging technology. As a growing number of drug developers launch proof-of-concept AI projects, a significant number are struggling to scale them. “There’s a struggle with extensibility,” Anderson said. To counter the problem, many drug developers are shifting their focus to data standardization and data engineering efforts. “It’s essential to ensure that data acquired in the laboratory or the manufacturing plant is readily available to AI models,” Anderson said.

Drawing from Gartner analyst Doug Laney’s three V’s concept (data volume, velocity, and variety) introduced in 2001, this article delves into best practices for AI integration in drug discovery. These principles, used across industries, serve as a roadmap toward successful and scalable implementation of AI in drug discovery.

Dealing with velocity: When a trickle turns into a torrent

One hallmark of recent scientific advances — from genomics to mass spectrometry — is the quickening pace of data generation. The NIH’s National Human Genome Research Institute notes that computer systems will need an estimated 40 exabytes to house the genome-sequence data generated worldwide by 2025.

Other techniques that rapidly churn out data include Next-Generation Sequencing (NGS), proteomics, and molecular dynamics simulations, alongside cryo-EM microscopy. As Anderson points out, “One cryo-EM microscope can generate 10s if not 100s of datasets on a daily basis.”

To make sense of such a rapid influx of data, format and structure are key. Data needs to be structured so that it can be readily consumed by AI and ML models. Reformatting and standardizing this data in a comprehensive manner can be time- and cost-consuming for mature pharmaceutical firms with substantial and fragmented data sets.

A significant companies are prepared to invest significant sums in AI projection, according to recent data from CRB’s Horizons Life Sciences 2023 report. One-fifth were planning on spending $10–50 million over the next two years, while 49% planned on spending $1–$10 million.

Anticipating crushing data volumes in AI drug discovery projects

Closely related to the theme of data velocity is data volume. “A single cryo-EM measurement can produce data as large as 80 gigabytes,” Anderson said. “Given the rapid data generation capability of a cryo-EM microscope, you’re soon dealing with terabytes of data.”

Storing such data volumes can prove challenging. “Transferring such vast amounts of data to an AI model on the cloud, whether it’s AWS, Google Cloud, or Azure, in a just-in-time manner presents a significant network challenge,” Anderson said.

Anderson notes anecdotes of labs that have used the practice — sometimes referred to as “sneakernet” — to transfer large data sets via a storage device with a courier. “It’s always interesting to note that, similar to the tanker car full of water analogy, here we have a tanker car full of data. And it sometimes seems easier to ship than to push through a hose,” he said.

As outlined earlier, dealing with the challenge requires a robust approach to data engineering. This process involves reformatting, structuring, cleaning and standardizing data to ensure it’s readily consumable by AI and ML models. Industry experts and well-crafted automated tools can accelerate the process, but it often remains a network challenge

Finding the signal in the noise

Extracting meaningful insights from vast data sets requires AI models that can sift through noise and find patterns. While acknowledging the importance of funding for data-driven drug discovery efforts, Hansen said that size can sometimes pose a barrier to effective data integration and optimization. The ultimate value of the data in such cases is limited because it cannot be harnessed for actionable insights. “If you’re a big multinational pharma, you built your workflows organically over the years, and you started doing that before computers even existed,” he said. Attempting to align thousands of R&D professionals to use a new data system can be nearly impossible. “The change management problem there and the switching cost are so gigantic that it cannot be solved properly,” Hansen said. “Even if they’ve generated terabytes of data over the years, that data is not useful for learning because it’s not formatted in a standard way.”

“If you let people generate data, and you don’t enforce a standard, you get chaos. If you have chaos, you’ve lost the data. It’s not organized, and you’ve lost the ability to connect pieces and ask a computer to solve some problem,” Hansen said.

Data source: CRB’s 2023 Horizons: Life Sciences Report. The survey encompassed a diverse range of companies, from startups to established pharmaceutical and biotech firms across North America and Europe. 

Dealing with data heterogeneity challenge

While massive curated data volumes can empower drug discovery and development, the avalanche of data generated from everything from Cryo-EM microscopes to high-throughput screening platforms can also overwhelm. “The data from instruments and assays is often heterogeneous,” Anderson said. “You might have one format for one type of instrument and a different proprietary format for the same measurement technique.”

“We face a confluence of challenges,” Anderson said. “The lack of a consistent format based on instrumental data outputs, the heterogeneity of these formats, and the size and network issues associated with transferring data from the instrument to its ultimate destination.” Drug discovery professionals must aim to efficiently funnel training data into these models without incurring exorbitant costs.

Challenges in AI integration and data management

Additionally, AI and ML models need to be able to digest and use the data they receive. Ensuring that outcome, however, is often not straightforward. “We face a confluence of challenges: the lack of a consistent format based on instrumental data outputs, the heterogeneity of these formats, and the size and network issues associated with transferring data from the instrument to its ultimate destination,” Anderson said. Our recent efforts have shifted from helping customers build new AI models to ensuring that training data reaches these AI models as economically as possible.”

Carl Hansen, CEO of Abcellera, recently commented on the complexities of data sharing in antibody drug discovery. The process involves setting up experiments to generate antibody diversity, then identifying and characterizing promising sequences, and finally engineering them for optimal performance. Research continues in animal models and with an in-depth characterization of biophysical properties. Imagine trying to track every detail with spreadsheets. “If you were to run around and follow the experiment with a spreadsheet, you would have to update information at every single step as to what was done, what was the plate, did that experiment work, who touched it?” Hansen said. “And at the end, you are trying to make a decision. And you have 300 people together because they are emailing spreadsheets to each other – they have 2,000 different spreadsheets, and they don’t match up.”

Ensuring data accuracy and reliability with AI in drug discovery

Ensuring data accuracy and reliability in drug discovery is not merely a best practice — it’s paramount. AI and machine learning models, which are playing a growing role in drug discovery and drug development, hinge on the quality of their training data. As the old adage goes, “garbage in, garbage out.”

Anderson refers to data as the “heartbeat” of chemistry R&D whether it stems from physical experimentation or in silico calculations.

For instance, data-driven predictions based on the likely ionic form of a drug upon administration and distribution can shed light on how a molecule will behave in the body. Broadly speaking, ionization refers to how certain parts of a drug behave when they enter our body. The body doesn’t absorb ionized drugs as efficiently as non-ionized drugs. “The body is buffered, right? Physiologic pH versus intestinal pH are two very different values,” Anderson said. Because the stomach is more acidic than blood, when a person takes an oral drug, its ability to permeate, for example, across the intestinal membrane, is determined by its ionization. Studying a set of molecules to predict how a particular molecule will behave in the body can guide drug discovery strategies.

Ionization predictions and data veracity

“Augmenting the models built using machine learning with ionization predictions can enhance their accuracy,” Anderson asserts. This accuracy results in more druggable chemical structures and a deeper understanding of structure-activity relationships (SAR). “If the SAR models don’t account for ionization at physiologic pH, you might optimize molecule derivatives in a less effective manner,” he added. Ionization can lead to between model predictions and actual outcomes.  In essence, the veracity of data and the accuracy of these predictions can shape the trajectory of drug discovery.

But data veracity can sometimes be elusive in drug discovery, given the complexity of research and the sheer variety of professionals and steps involved in the process spanning initial research to clinical trials. For instance, the prevalent use of spreadsheets to chronicle data across various stages of antibody drug discovery can result in an explosion of relative truths. With each spreadsheet being relative to its creator’s perspective, discerning a single truth becomes an uphill task. “You can’t compare the things and the handoff between the groups relies on a huge amount of information transfer,” he said. The practice introduces “massive friction in the organization.”

Addressing this friction requires building a centralized ground truth for data. “You have to have software systems that ensure that everyone is delivering high-quality data in the right format, into the centralized part of the organization,” Hansen said.

Ultimately, data from across drug discovery efforts should flow from the lab to the manufacturing plant to further enhance AI models. Throughout the process though, there is little room for error. “Biotech is about what’s true, and it’s about science,” Hansen concluded.


Filed Under: Data science, Drug Discovery, machine learning and AI
Tagged With: AI integration, cryo-EM microscopy, data engineering, data standardization, data velocity, data volume, drug development
 

About The Author

Brian Buntz

As the pharma and biotech editor at WTWH Media, Brian has almost two decades of experience in B2B media, with a focus on healthcare and technology. While he has long maintained a keen interest in AI, more recently Brian has made making data analysis a central focus, and is exploring tools ranging from NLP and clustering to predictive analytics.

Throughout his 18-year tenure, Brian has covered an array of life science topics, including clinical trials, medical devices, and drug discovery and development. Prior to WTWH, he held the title of content director at Informa, where he focused on topics such as connected devices, cybersecurity, AI and Industry 4.0. A dedicated decade at UBM saw Brian providing in-depth coverage of the medical device sector. Engage with Brian on LinkedIn or drop him an email at bbuntz@wtwhmedia.com.

Related Articles Read More >

Capgemini’s life-sciences lead says ROI and data security, not algorithms, will decide pharma’s AI future
Portrait of happy smiling mature middle aged professional business woman investor manager executive or lawyer attorney looking at camera at workplace working on laptop computer in office.
As FDA pushes agency-wide generative AI, pharma experience show similar tools can cut clinical study-report drafting time by 30% or more
FDA’s genAI push could save CDER hundreds of thousands of review hours annually
Elsevier plugs 500,000 ClinicalTrials.gov records into Embase
“ddd
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest news and trends happening now in the drug discovery and development industry.

MEDTECH 100 INDEX

Medtech 100 logo
Market Summary > Current Price
The MedTech 100 is a financial index calculated using the BIG100 companies covered in Medical Design and Outsourcing.
Drug Discovery and Development
  • MassDevice
  • DeviceTalks
  • Medtech100 Index
  • Medical Design Sourcing
  • Medical Design & Outsourcing
  • Medical Tubing + Extrusion
  • Subscribe to our E-Newsletter
  • Contact Us
  • About Us
  • R&D World
  • Drug Delivery Business News
  • Pharmaceutical Processing World

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search Drug Discovery & Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE