
Image: Wikimedia Commons (Public Domain)
In drug discovery, the difference between a breakthrough and a breakdown can be as small as the twist of a bond. Stereochemistry, the three-dimensional arrangement of atoms, governs how otherwise identical molecules behave in the body: whether they bind the intended target, trigger off-targets, or get cleared in minutes rather than hours.

Adam Sanford, Ph.D.
“Shape really does matter,” said Adam Sanford, Ph.D., senior product lead at scientific information firm CAS, a division of the American Chemical Society. “It could be an identical structure, but have a different handedness, so to speak. It might make the difference between it working at all and being extraordinarily effective.”
Yet records containing stereo information can be fragile or inconsistent with the data disappearing in scanned PDFs, hand-offs between lab notebooks and databases introduce transcription errors, and OCR software routinely mangles the symbols that encode three-dimensional orientation. In addition, file-format conversions between systems can lose configurational details. No matter the cause of potential errors or data loss, the results are not good. “If you lose that stereo information, or it gets jumbled… a lot of the data… is not really actionable… [and] may lead you down a completely incorrect path,” Sanford said.
The consequences of systematic errors as well as random errors can ripple through computational workflows: errors in stereochemical representation can propagate into QSAR models, pharmacophore models, and docking experiments. As a result, such errors can lead to misleading virtual screening results that hinder chemical design and drug discovery efforts.
The stakes are high for machine learning applications in drug discovery, where practitioners acknowledge that the predictive power of any ML approach depends on data quality. As a rule of thumb, the practice consisting of at least 80% data processing and cleaning and only 20% algorithm application. Industry surveys have found data preparation often dominates data-science effort, and reviews in drug discovery emphasize that high-quality curated data are prerequisite for reliable ML. When training data lack accurate stereochemical information, the resulting noise and errors can compromise the reliability of AI models, affecting everything from virtual screening to property prediction. “In cases where we… train those algorithms on less sophisticated data, you see a direct impact in the overall efficacy of the… results,” Sanford said.
The stakes
The consequences of stereochemical oversights aren’t limited to prediction errors or wasted R&D dollars. The field learned that lesson with thalidomide, marketed as a sedative between 1957 and 1961, which caused severe birth defects in over 10,000 children. One enantiomer eased morning sickness; the other was teratogenic.

Orr Ravitz, Ph.D.
As Orr Ravitz, Ph.D., senior product manager on the CAS life sciences team, noted, “Administering the thalidomide in only… the unharmful form… doesn’t help, because in vivo, the two forms are inter-convertible… there’s no way to give this drug safely to pregnant women.”
This catastrophic failure forced the industry and regulators to confront the importance of stereochemistry. By the early 1990s, the FDA issued its Policy Statement for the Development of New Stereoisomeric Drugs, requiring that “the stereoisomeric composition of a drug with a chiral center should be known” and that sponsors demonstrate identity, strength, quality, and purity “from a stereochemical viewpoint.” As Ravitz put it, “If the drug candidate has a stereocenter, the FDA requires investigation of different stereoisomers.” The result: by the 2010s, the majority of new FDA approvals were single enantiomers rather than racemates, and companies developed robust analytical methods to characterize stereoisomers. Chiral drugs remain common in recent approvals: 20 of the 35 novel FDA approvals in 2020 were chiral, reinforcing why stereo-aware data matters operationally.
The new vulnerability
But the rise of computational and AI-driven discovery has created a new vulnerability across the industry. In many workflows, machine learning models now ingest thousands of structures automatically without human review, propagating any stereochemical inconsistencies directly into predictions. Organizations that maintain human validation and curation can catch these errors before they corrupt downstream analyses. As one study notes, stereochemistry affects drug-receptor binding, metabolism, and toxicity.
“All the time. Happens all the time,” Sanford said when asked if stereo errors occur in practice. “There are all these points where errors can be introduced.” They span from electronic lab notebooks to PDFs, PDFs to databases, databases to training sets. Missed wedges and dashes in figures, OCR damage, file-format conversions, manual transpositions. Each transition introduces opportunities for stereochemical information to degrade or disappear entirely.
Recent research found that multiple deep-learning docking methods produced poses with physically implausible ligand configurations, including wrong stereochemistry, even when other accuracy metrics looked acceptable. The models appeared to work, until you examined the actual three-dimensional geometry they predicted.
On the flip side, when data is clean, new capabilities emerge. “Once we got data that was very comprehensive and reliable in terms of stereochemistry, suddenly the ability to derive stereospecific reaction rules became realistic,” Ravitz said.
The path forward
Given the risks involved, Sanford argues teams should treat chirality as an operational problem: establish standards, define specifications, implement tests. Feed design, synthesis and modeling with stereo-correct, identifier-harmonized data from the moment hits emerge.
“One of the things that’s really important for the early-stage drug discovery scientist is, what’s going to happen when I introduce that therapeutic into the body?” Sanford said. “In many cases, the active part of the therapeutic isn’t what you introduce; it’s what your body does to it.” Stereochemistry governs all of it: how compounds are metabolized, how long they persist, which pathways they activate.
As one review notes, “the availability of high-quality, accurate and curated data in large quantities” remains a fundamental challenge in applying machine learning to drug discovery. For AI-driven discovery to deliver on its promise, that challenge starts with getting the stereochemistry right.
Filed Under: Drug Discovery, Drug Discovery and Development, machine learning and AI



