
Abstract representation of molecular structures and data processing for AI in drug discovery
Machine learning (ML) has become a cornerstone of modern drug discovery. Today, many companies are taking ML developments a step further using generative AI (GenAI) to search for molecules under specific constraints, such as solubility and patent status, and to optimize for desired properties like potential therapeutic success. In doing so, GenAI can enhance the efficiency, speed and creativity of the new drug discovery process.
Yet despite their potential to reshape the way scientists explore molecular spaces, ML and GenAI depend on data quality. The reality is, however, many companies neglect the fundamental data engineering steps when deploying ML and GenAI technologies.
Successful implementation of ML and GenAI in drug discovery thus requires an optimized approach to research informatics — merging data science practices into drug discovery pipelines to accelerate innovation.
The problems with unstructured data
Approaching AI-powered drug discovery without expertise in medicinal chemistry and understanding of assay data can hinder model development and performance. Without this knowledge, you can introduce data management inefficiencies that bias model outputs toward chemically implausible or suboptimal structures.
One example of this is improper handling of tautomer data. Tautomers, structural isomers of a chemical compound that readily interconvert, may look structurally different from each other, but they can be functionally equivalent. Two experienced chemists can treat two tautomers differently, and many chemical databases will represent these tautomers as completely different compounds. This can create data management challenges where relevant information like properties aren’t associated with the correct structures.
Failure to identify a canonical tautomer is another potential hurdle. Without a standardized canonical tautomer form, the dominant resonance form won’t be registered within the database and will incorrectly bias the AI model to predict compounds with less relevance.
Properly addressing tautomer data has cost implications as well. Many research organizations request assays of multiple tautomeric forms that are functionally identical. This duplication of efforts directly contributes to increased project costs and timelines.
It isn’t a given that data engineers would be aware of chemical distinctions to this degree of depth — although they might. However, oversights like mishandled tautomer data can be avoided when the data engineers structuring chemical databases for AI are also chemistry specialists.
Data processing for chemical research
Medicinal chemists and other professionals implementing ML and GenAI models often need to apply strategies to their drug discovery workflows that are outside their expertise. While an additional university degree in data science isn’t required, adopting methods from this field is extremely beneficial to optimizing pharmaceutical research informatics.
Domain-specific data management is fundamental to high-quality AI and ML implementations.
Successfully integrating GenAI with drug discovery will involve these key extract, transform, load (ETL) and processing tasks and more.
Processing chemical structure and biological data
Since ML and GenAI models require standardized and structured data, it’s important to address the range of data types and formats involved in pharmaceutical research to create an AI-based cheminformatics and bioinformatics pipeline.
Case in point, while chemical formats like SMILES, InChI, SDF and MOL all describe molecular information, their different data types impede their integration into one standardized structure. SMILES and InChI are one-dimensional text data types, for example, while SDF/MOL files are multidimensional. This prevents these formats from being easily ingestible by AI models. Handling stereochemistry, 3D conformation, salts and tautomers are additional considerations to keep in mind.
Biological data also requires processing. Although protein databases like PDB and UniProt are useful, they store information in different data types. Further, the bioassay information in databases like ChEMBL and PubChem typically contains inconsistent units to measure binding effectiveness, requiring normalization into a standard unit. Among the various databases, proteins also tend to have different IDs and naming conventions, which can be resolved by mapping names to each other.
Integrating text-based information
A variety of formats can represent text information, making it necessary to preprocess data before ingesting into a large language model (LLM). Harmonizing information contained within publications, patents, notes, safety data sheets and clinical trials enables an LLM to work with the data.
LLMs require extensive data preprocessing to accommodate various text formats like PDFs, tables, images and labels. Tools for web scraping or optical character recognition can extract raw and structured text data for the LLM. After extraction, the data must be normalized into a structured format like a JSON file. Converting this information into numerical vectors that the LLM can “understand” enables structured reasoning to form relationships like drug interactions.
Machine learning and generative AI in drug discovery
While ML and GenAI in pharmaceutical research are closely linked, their specific applications differ slightly. ML models are often used to identify or predict compounds based on properties or performance — for example, predicting a compound’s toxicity, binding affinity or patient response.
On the other hand, GenAI can create entirely new molecular structures based on learned characteristics of known compounds. Using the same example, GenAI could be used to generate novel structures with desired toxicities, binding affinities and patient responses.
ML models used in GenAI, including variational auto encoders (VAE), generative adversarial networks (GAN) and recurrent neural networks (RNN), have the capability to generate new molecules based on existing molecules. Input data can be images like chemical drawings depicting structure, or text-based data like SMILES strings.
Similarly, GenAI using these models could produce new chemical drawings and SMILES strings based on the data of known compounds. Keep in mind, GenAI architectures are built with ML components, so the same data input requirements apply.
Implementing cleanup and quality control
During this process, standard data processing quality checks, such as detecting and handling missing values in tabular data, still apply. Other procedures to implement include removing outliers and validating the correctness of chemical structure information. Because of the large volumes of data involved in drug discovery, optimizing data structures is key to speed up GenAI and ML processing.
GenAI drug discovery in action
Drug discovery projects often involve multiple organizations sharing information such as assay data and synthesis requests. Integrating the disparate information systems into one centralized platform will streamline cooperation and reduce project timelines. Implementing ML and GenAI tools into this central system further amplifies these benefits.
Unfortunately, the incompatibility of each organization’s database and content may hinder integration into one consolidated structure. Designing a comprehensive data model that reconciles the differences between cooperating organizations provides a solution.
This unified data model (UDM) facilitates easy data-sharing among drug discovery partners and can even be designed to enable ML and GenAI use.
How we did it
Our specialty in streamlining pharmaceutical workflows for drug discovery allows us to optimize information systems that integrate with chemistry, pharmaceutical, ML and AI applications. One example is the BioChemUDM, which can represent compounds and assays for capturing, reporting and sharing biological and chemical data among pharmaceutical companies, regardless of the platforms used to manage chemical registration and assay data.
BioChemUDM registers compounds with a stereo-enhanced sketch according to specified drawing rules. For further categorization, this data model also applies string-based labels associated with the registered compounds. These labels, which inform users of useful information on molecules, batches, samples and assays, are easily parsed by commercial applications, reducing application integration complexity and time.
Because tautomerism plays an important role in drug performance, effectively managing tautomer data is key to designing a data model that is optimized for drug discovery. Within BioChemUDM, we addressed tautomers by normalizing tautomeric states according to SMIRKS patterns.
Assay data also deserves a close look. We organized assay information according to broad categories that would group similar assays together. This allowed us to differentiate assays based on types of activation, binding, inhibition, oxidation and more. Within the assay identification strings are fields to identify an assay’s test article, format, target, documentation and other fields.
BioChemUDM doesn’t rely on a new standard of representing compounds. Instead, this data model uses the common V3000 format to register chemical structures according to their enhanced stereochemical connection table. Registration is done via sketches according to standardized drawing rules. By standardizing one identification format, BioChemUDM eliminates user-specific terms to describe stereochemistry. Instead, it relies entirely on chemical graph theory. This reliance enables organizations to register their individually siloed data into the UDM.
To date, multiple companies have adopted BioChemUDM, finding it easy to adopt string labeling and sketch rules to label their data. Using this system, information between organizations becomes sharable within the same day.
Emphasize both chemistry and data expertise
The challenges ML and AI implementation in drug discovery aren’t automatically resolved by applying data engineering practices. Given the widely varied types, formats and structures of chemical and biological data used in pharmaceutical research — it can be challenging to effectively apply data science methodology without possessing drug discovery domain knowledge.
Without domain-specific knowledge of drug discovery, which spans chemistry, assay, and industry information, significant oversights can be made. Challenges are exacerbated when multiple organizations with siloed data systems must collaborate.
Workflow Informatics (WFI) is ideally suited for addressing the needs of drug discovery because of our experience as both data engineers and medicinal chemists. We draw from our decades of cheminformatics experience to design UDMs that provide drug discovery collaborators with AI and ML tools. Such approaches are applicable beyond pharmaceutical research as well, including other disciplines like materials, agriculture and food science, aerospace and automotive industries.
To learn more, check out our paper on BioChemUDM.
Chris Lowden is the CEO of Workflow Informatics Corp.
Filed Under: machine learning and AI