Speaking of journals, a growing number of papers are documenting the problem. A study published in the Journal of Medical Internet Research (2024), for instance, documents hallucination rates across major language models, ranging from 28.6% for GPT-4 to 91.3% for Google’s Bard (now Gemini). Meanwhile, an Arxiv preprint systematically analyzed 333 unique mentions of AI hallucination across 14 academic databases, discovering significant inconsistencies in how researchers conceptualize and measure this phenomenon.
Reducing science friction
In any event, the practical risks for scientific applications is considerable, especially in fields like drug development where accuracy is paramount. Elsevier’s “Insights 2024: Attitudes toward AI” study quantifies this concern: 86% of researchers worry AI will “cause critical errors or mishaps,” and about seven in ten (71%) respondents expect generative AI tools’ results to be based on high-quality, trusted sources only.
Retrieval Augmented Generation (RAG) offers a path to greater trustworthiness and accuracy and a way to go beyond the black-box dynamic of using off-the-shelf chatbots. In essence, RAG grounds a generative AI model with an external data source (typically a vector database). The approach helps generative models to cite data sources, and thus helps humans understand the provenance of the data used to inform a generative AI model’s response. “You need explainability and the ability to validate hypotheses to avoid critical errors,” said Joe Mullen, director of data science and professional services for SciBite at Elsevier.
Precision engineering for reliable AI
Implementing RAG effectively requires precise technical controls and human oversight. “You can’t just rely on efforts from commoditized providers,” Mullen said. The need for explainability and reliability aligns with the aforementioned Elsevier’s survey finding that 71% of respondents demand high-quality, trusted source validation.
A robust evaluation framework must incorporate quantitative metrics—like precision and recall in information retrieval—and qualitative assessments from subject matter experts. “You need to have humans in the loop to evaluate how performant these systems are against real scientific problems,” Mullen said. “We understand the importance of marrying the content, technology, and expertise.” The goal is to ensure proper expert oversight to fine-tune these AI systems. In drug development, this approach helps “improve the efficiency of the entire drug development process, whether that’s preclinical to clinical, and also post-marketing surveillance,” Mullen added.
There’s a balancing act involved in integrating both internal and external information sources. Data architectures need to be flexible enough to segregate or combine information based on specific requirements. “There are scenarios where you want to keep data sets separate,” Mullen noted, describing how companies might need to query proprietary and public datasets independently. At the same time, the optimal application of AI in research relies on the integration of reliable scientific data, encompassing both high-quality external sources and internal datasets within secure computational frameworks.
SciBite, for instance, has the ability to handle data from different sources by treating them uniformly. “For example, if you enrich both data sources with the same ontologies or vocabularies, you can integrate them using our technology components,” Mullen said. “As long as you can ensure the data’s accuracy and treat data from different sources consistently—verifying it, for example—you can retrieve and use that data in the same manner, regardless of its origin.”
Reproducibility in the era of stochastic AI
The scientific community has long grappled with reproducibility issues. A landmark 2016 survey in Nature revealed that over 70% of scientists had failed to reproduce other researchers’ experiments, with more than half unable to replicate their own work. This challenge persists today and is compounded by the inherent variability in generative AI models. “By its nature, gen AI is stochastic,” Mullen notes. “Asking it the same question multiple times might yield slightly different responses.”
SciBite at Elsevier
This variability introduces an “interesting reproducibility issue” in both traditional experimental methods and AI-assisted research processes. To address this, precise engineering controls and guardrails are essential. “You need guardrails in place to ensure you’re applying it where it makes sense—for example, converting natural language to syntax for the information retrieval step of a RAG system, or at the end step where you say, ‘This is the context; provide an answer from the content I’ve returned.'”
Success depends on what Mullen describes as a three-part foundation: “quality data, the right technology, and an efficient means of interaction.” This foundation helps to minimize the stochasticity of generative AI, enhancing reproducibility in scientific experimentation using RAG and ensuring trustworthy AI systems.
And from a research perspective, the potential of RAG systems to scour academic literature can enable researchers to more clearly understand the scientific parameters of similar studies, enabling them to design more rigorous experiments.
The rise of the research agent
The evolution of AI research tools is moving beyond basic information retrieval toward more sophisticated, autonomous capabilities—a shift Mullen refers to as “agentification.” This transition has significant implications for scientific research methodologies and connects back to the earlier RAG implementation challenges of ensuring reliability and reproducibility, but gives the AI system more responsibility in ensuring the relevance and accuracy of the information retrieved.
In clinical research applications, the potential of well-architected agentic RAG systems extends far beyond basic information retrieval. The complexity of drug discovery and clinical trial workflows demands sophisticated, multi-component orchestration.
“The information retrieval step could involve a combination of ontology-based and vector-based methods,” Mullen explains. “This means recognizing that questions can go to multiple sources, determining optimal routing pathways, ranking returned information, and executing complex process chains.”
Clinical trial optimization represents a prime use case for these advanced capabilities. “When it comes to recruitment, outcome prediction, and trial design optimization—that’s where you’re going to see a lot more application of AI in the future,” Mullen notes. These applications could benefit from sophisticated agentic RAG systems that go beyond simple query-response patterns, instead orchestrating multiple data streams and analytical processes in a coordinated, traceable manner.
Toward a symbiotic partnership
As AI architectures become more sophisticated, the focus will be on augmenting rather than replacing human researchers. Human researchers will have the ability to reimagine their roles. “It’s important that humans remain at the center of decision-making—the interpretation of data and validation of results,” Mullen emphasizes. “An agentic system doesn’t necessarily remove human involvement,” he explains. “We can think of it as a workflow where at every step, you could have human input: ‘This is the step I’ve reached, this is the data returned, and this is where the system is.'” This level of engagement varies by task complexity—from continuous oversight in research-intensive processes to lighter touch validation for operational tasks.
Looking ahead, success in AI-assisted research will depend on developing systems that prioritize accuracy, trustworthiness, and security. As the “Insights 2024” survey indicates, scientists expect generative AI tools to be based on high-quality, trusted sources. Organizations that achieve this balance while maintaining human centrality in scientific decision-making will distinguish themselves in the field. “When undertaking complex tasks, traceability between input and output becomes critical,” Mullen said. And it goes without saying that the hype surrounding AI can sometimes lead to applications that are solutions looking for problems to solve. “I think it’s really important to get your head around understanding what the problem is before you even start to think about what the solution is that you should be bringing to tackle that.”
Filed Under: clinical trials, Drug Discovery, machine learning and AI