Prakash explained there is simply “not enough data” to rely on AI as the primary means of small molecule drug discovery. He discussed these challenges further in an article in American Pharmaceutical Review titled “Exploring New Chemical Space for the Treatments of Tomorrow.”
The limitations and challenges of AI-first methods
Even with high-throughput screening to automate the process of testing pre-synthesized drug candidates against disease-associated target proteins, the pharma industry has managed to generate experimental data for fewer than ten million distinct chemicals out of a billion trillion trillion (1033) unique drug-like compounds synthesizable under the rules of organic chemistry. And “AI-first” methods can only be trained on that existing experimental data. It is like exploring a drop of water in an ocean, he said.
In addition, even within that limited available experimental data, a sizable portion is of questionable quality and often not reproducible, he said. A 2022 Patterns article and a preprint on ArXiv reached similar conclusions regarding the need for high-quality data. To that list, these articles added the lack of interoperability and the curse of dimensionality. The latter problem refers to the fact that machine learning models require a large amount of data to accurately learn and make predictions. But as the dimensionality (number of features) grows, the amount of data they require explodes.
Another of Prakash’s core points is the lack of a particular data type critical for machine learning: negative data from lab experiments and clinical trials. In drug discovery, “failures” are published far less often than positive findings. “Negative data is as important as positive data for training an ML model,” he said.
AlphaFold: A game of evolutionary guesswork
It is tempting to believe that AI’s success in other areas will translate into success in drug discovery, especially when considering recent advances in the field. One of the most noteworthy AI tools to emerge in recent years is AlphaFold, a groundbreaking tool from DeepMind, which can accurately predict protein folding. Before AlphaFold debuted in 2021, researchers relied on The Worldwide Protein Data Bank (PDB) to provide experimentally determined protein structures for about 90% of disease programs. AlphaFold augments the PDB data by predicting structures for those proteins where the structure was previously unknown.
But the application of AI in such a complex field is not without its challenges. The effectiveness of AI is partly a function of the quality and size of training sets. DeepMind actively designed AlphaFold to actively predict previously unknown protein structures, but DeepMind did not leave it to operate without actively providing sufficient foundational data. Several large databases offered vast numbers of known protein structures and their amino acid sequences across multiple species. This wealth of information, combined with various evolutionary rules that preserve the structure and function of proteins across species, provided a robust training ground for AlphaFold, playing a central role in its success, Prakash noted.
Drug discovery is different from protein folding in ways that make it a far more daunting challenge for AI. Knowing a protein’s structure is a first step. But discovering new drugs requires understanding how that protein will bind to a novel drug structure. “But there is no binding data for truly novel compounds,” Prakash pointed out. And tools like AlphaFold cannot make these predictions.
Understanding the challenges inherent in small molecule drug discovery
Prakash provides an eye-opening statistic to illustrate the current limitations of AI in drug discovery, estimating that we only have available data on a 0.000000000000000000000001% of the drug-like chemical universe. “It is virtually impossible for current AI approaches to find breakthrough novel drugs unaided,” he explained. “And when you look at the companies who purportedly ‘discovered’ drugs with AI, you find that most have developed small ‘me-too’ modifications of well-trodden drug structures.
Given the impossibly prohibitive time and cost involved in experimentally generating binding data for these novel drug structures, researchers must computationally derive the required data by simulating protein-drug interactions using highly accurate molecular-physics models. “Available garden-variety physics models won’t do,” Prakash said. Scientists need further advances in physics simulations to generate the data required to train next-generation AI models. Follow-up experimental data on promising novel structures can then further enhance these AI models, allowing researchers to systematically design and optimize novel drugs.
Ultimately, AI is one tool of many, not a cure-all, Prakash concluded. Thoughtfully integrating advances in AI, physics, chemistry and biology is required to explore the ‘uncharted chemical ocean’ of potential new small molecule drugs. “Small molecule drug discovery requires progress across diverse fields and smart application of the resulting integrated tools,” he noted.
“AI is not magical,” Prakash asserted. “We must understand where it’s valuable and where it’s not.”
In reflecting on the future of drug discovery in the era of fast-moving technology, he concluded, “If you don’t use AI, you’ll be left behind,” Prakash concluded. “But the key to success is building and using all the other complementary tools required to generate the required high-quality data. AI acts on data.”
Filed Under: Data science, Drug Discovery, Industry 4.0, machine learning and AI