
3D representation of the HTLV-1 intasome (a type of nucleoprotein complex) structure as visualized through cryo-electron microscopy. [Adobe Stock]
Andrew Anderson, vice president of innovation and informatics strategy at ACD/Labs, likens it to boiling water. “On a small scale, it’s straightforward. But boiling a tanker cart full of water? That’s a different ballgame,” he said.
Thinking through these aspects is vital to address the ‘pilot-purgatory’ problem, which frequently arises in projects involving emerging technology. As a growing number of drug developers launch proof-of-concept AI projects, a significant number are struggling to scale them. “There’s a struggle with extensibility,” Anderson said. To counter the problem, many drug developers are shifting their focus to data standardization and data engineering efforts. “It’s essential to ensure that data acquired in the laboratory or the manufacturing plant is readily available to AI models,” Anderson said.
Drawing from Gartner analyst Doug Laney’s three V’s concept (data volume, velocity, and variety) introduced in 2001, this article delves into best practices for AI integration in drug discovery. These principles, used across industries, serve as a roadmap toward successful and scalable implementation of AI in drug discovery.
Dealing with velocity: When a trickle turns into a torrent
One hallmark of recent scientific advances — from genomics to mass spectrometry — is the quickening pace of data generation. The NIH’s National Human Genome Research Institute notes that computer systems will need an estimated 40 exabytes to house the genome-sequence data generated worldwide by 2025.
Other techniques that rapidly churn out data include Next-Generation Sequencing (NGS), proteomics, and molecular dynamics simulations, alongside cryo-EM microscopy. As Anderson points out, “One cryo-EM microscope can generate 10s if not 100s of datasets on a daily basis.”
To make sense of such a rapid influx of data, format and structure are key. Data needs to be structured so that it can be readily consumed by AI and ML models. Reformatting and standardizing this data in a comprehensive manner can be time- and cost-consuming for mature pharmaceutical firms with substantial and fragmented data sets.
A significant companies are prepared to invest significant sums in AI projection, according to recent data from CRB’s Horizons Life Sciences 2023 report. One-fifth were planning on spending $10–50 million over the next two years, while 49% planned on spending $1–$10 million.
Anticipating crushing data volumes in AI drug discovery projects
Closely related to the theme of data velocity is data volume. “A single cryo-EM measurement can produce data as large as 80 gigabytes,” Anderson said. “Given the rapid data generation capability of a cryo-EM microscope, you’re soon dealing with terabytes of data.”
Storing such data volumes can prove challenging. “Transferring such vast amounts of data to an AI model on the cloud, whether it’s AWS, Google Cloud, or Azure, in a just-in-time manner presents a significant network challenge,” Anderson said.
Anderson notes anecdotes of labs that have used the practice — sometimes referred to as “sneakernet” — to transfer large data sets via a storage device with a courier. “It’s always interesting to note that, similar to the tanker car full of water analogy, here we have a tanker car full of data. And it sometimes seems easier to ship than to push through a hose,” he said.
As outlined earlier, dealing with the challenge requires a robust approach to data engineering. This process involves reformatting, structuring, cleaning and standardizing data to ensure it’s readily consumable by AI and ML models. Industry experts and well-crafted automated tools can accelerate the process, but it often remains a network challenge
Finding the signal in the noise
Extracting meaningful insights from vast data sets requires AI models that can sift through noise and find patterns. While acknowledging the importance of funding for data-driven drug discovery efforts, Hansen said that size can sometimes pose a barrier to effective data integration and optimization. The ultimate value of the data in such cases is limited because it cannot be harnessed for actionable insights. “If you’re a big multinational pharma, you built your workflows organically over the years, and you started doing that before computers even existed,” he said. Attempting to align thousands of R&D professionals to use a new data system can be nearly impossible. “The change management problem there and the switching cost are so gigantic that it cannot be solved properly,” Hansen said. “Even if they’ve generated terabytes of data over the years, that data is not useful for learning because it’s not formatted in a standard way.”
“If you let people generate data, and you don’t enforce a standard, you get chaos. If you have chaos, you’ve lost the data. It’s not organized, and you’ve lost the ability to connect pieces and ask a computer to solve some problem,” Hansen said.
Data source: CRB’s 2023 Horizons: Life Sciences Report. The survey encompassed a diverse range of companies, from startups to established pharmaceutical and biotech firms across North America and Europe.Dealing with data heterogeneity challenge
While massive curated data volumes can empower drug discovery and development, the avalanche of data generated from everything from Cryo-EM microscopes to high-throughput screening platforms can also overwhelm. “The data from instruments and assays is often heterogeneous,” Anderson said. “You might have one format for one type of instrument and a different proprietary format for the same measurement technique.”
“We face a confluence of challenges,” Anderson said. “The lack of a consistent format based on instrumental data outputs, the heterogeneity of these formats, and the size and network issues associated with transferring data from the instrument to its ultimate destination.” Drug discovery professionals must aim to efficiently funnel training data into these models without incurring exorbitant costs.
Challenges in AI integration and data management
Additionally, AI and ML models need to be able to digest and use the data they receive. Ensuring that outcome, however, is often not straightforward. “We face a confluence of challenges: the lack of a consistent format based on instrumental data outputs, the heterogeneity of these formats, and the size and network issues associated with transferring data from the instrument to its ultimate destination,” Anderson said. Our recent efforts have shifted from helping customers build new AI models to ensuring that training data reaches these AI models as economically as possible.”
Carl Hansen, CEO of Abcellera, recently commented on the complexities of data sharing in antibody drug discovery. The process involves setting up experiments to generate antibody diversity, then identifying and characterizing promising sequences, and finally engineering them for optimal performance. Research continues in animal models and with an in-depth characterization of biophysical properties. Imagine trying to track every detail with spreadsheets. “If you were to run around and follow the experiment with a spreadsheet, you would have to update information at every single step as to what was done, what was the plate, did that experiment work, who touched it?” Hansen said. “And at the end, you are trying to make a decision. And you have 300 people together because they are emailing spreadsheets to each other – they have 2,000 different spreadsheets, and they don’t match up.”
Ensuring data accuracy and reliability with AI in drug discovery
Ensuring data accuracy and reliability in drug discovery is not merely a best practice — it’s paramount. AI and machine learning models, which are playing a growing role in drug discovery and drug development, hinge on the quality of their training data. As the old adage goes, “garbage in, garbage out.”
Anderson refers to data as the “heartbeat” of chemistry R&D whether it stems from physical experimentation or in silico calculations.
For instance, data-driven predictions based on the likely ionic form of a drug upon administration and distribution can shed light on how a molecule will behave in the body. Broadly speaking, ionization refers to how certain parts of a drug behave when they enter our body. The body doesn’t absorb ionized drugs as efficiently as non-ionized drugs. “The body is buffered, right? Physiologic pH versus intestinal pH are two very different values,” Anderson said. Because the stomach is more acidic than blood, when a person takes an oral drug, its ability to permeate, for example, across the intestinal membrane, is determined by its ionization. Studying a set of molecules to predict how a particular molecule will behave in the body can guide drug discovery strategies.
Ionization predictions and data veracity
“Augmenting the models built using machine learning with ionization predictions can enhance their accuracy,” Anderson asserts. This accuracy results in more druggable chemical structures and a deeper understanding of structure-activity relationships (SAR). “If the SAR models don’t account for ionization at physiologic pH, you might optimize molecule derivatives in a less effective manner,” he added. Ionization can lead to between model predictions and actual outcomes. In essence, the veracity of data and the accuracy of these predictions can shape the trajectory of drug discovery.
But data veracity can sometimes be elusive in drug discovery, given the complexity of research and the sheer variety of professionals and steps involved in the process spanning initial research to clinical trials. For instance, the prevalent use of spreadsheets to chronicle data across various stages of antibody drug discovery can result in an explosion of relative truths. With each spreadsheet being relative to its creator’s perspective, discerning a single truth becomes an uphill task. “You can’t compare the things and the handoff between the groups relies on a huge amount of information transfer,” he said. The practice introduces “massive friction in the organization.”
Addressing this friction requires building a centralized ground truth for data. “You have to have software systems that ensure that everyone is delivering high-quality data in the right format, into the centralized part of the organization,” Hansen said.
Ultimately, data from across drug discovery efforts should flow from the lab to the manufacturing plant to further enhance AI models. Throughout the process though, there is little room for error. “Biotech is about what’s true, and it’s about science,” Hansen concluded.
Filed Under: Data science, Drug Discovery, machine learning and AI