When it comes to medical AI, the biggest names aren’t necessarily delivering the best results. While tech giants race to build ever-larger language models, a new preprint reveals that when it comes to clinical accuracy and physician trust, a smaller player is outperforming the industry heavyweights.
Putting large language model-based systems to the test
Researchers put five gen AI systems to the test, evaluating their ability to provide reliable, actionable medical advice. In the research, nine independent physicians tested the ability of five AI systems in answering 50 clinical questions based on relevance, reliability, and actionability.
The widely-known large language models (LLMs) – ChatGPT-4, Claude 3 Opus, and Gemini Pro 1.5 – struggled to deliver trustworthy answers. They only managed to provide relevant and evidence-based responses for 2% to 10% of the questions. In addition, these LLMs frequently “hallucinated” citations, with 25% to 47% of their cited sources being either fictitious or entirely irrelevant to the question at hand.
A retrieval augmented generation (RAG)-based system, OpenEvidence, fared better, delivering relevant and evidence-based answers to 24% of questions. But ChatRWD, an AI-powered chat-to-database application from Atropos Health, fared best, achieving a 58% success rate in providing relevant and evidence-based answers.
The need for trustworthy, clinical-grade generative AI
“I believe the whole generative AI space in healthcare is moving towards quality and trust, which is what we’ve been focusing on from the beginning,” said Dr. Brigham Hyde, CEO of Atropos Health, the developer of ChatRWD.
Many healthcare professionals have been exploring off-the-shelf large language models for the past roughly six months to a year, Hyde said. “And the hallucination problem is very real,” he added. “The goal is to find a way to use LLMs that offer convenience and speed while maintaining trust and accuracy, which is the holy grail for healthcare providers.”
Hyde notes that the version of ChatRWD used in the study is an early version. “We will not [formally] launch until our accuracy is in the 90-percent range,” he said. “And even then, we recommend that clinicians use it as a tool to inform their decisions.” He emphasized that it is still important for clinicians using such technology to tap experts to help contextualize the results.
The shift from convenience to trust
“I think what’s happening now is a shift from convenience to trust,” Hyde said. “As physicians, we’re being inundated with messages from these LLMs, and in our setting, where we’re producing evidence that could inform a treatment decision for a patient, we simply can’t afford a 20% error rate.”
Other companies are also working on developing more medically-accurate AI systems. One example is Google’s Med-PaLM 2, which has shown promising results in medical exams and answering consumer health questions. Other tech companies like IBM and Microsoft have similar initiatives.
Atropos is unique in its focus on providing rapid, high-quality real-world evidence (RWE) to support clinical decision-making and research in healthcare.
“Even the clinical trials we do run often exclude a significant portion of patients – around 70% – who have comorbidities,” Dr. Hyde explained. “And guess what? That’s about 70% of the patients doctors see every day.”
ChatRWD aims to bridge this critical gap by providing clinicians with rapid access to real-world evidence. “Once they [clinicians] input their query, we return a new study in under three minutes,” Hyde explained. This stands in contrast to traditional methods, where such comparative effectiveness studies can take six to eight weeks to conduct and often require large teams and significant resources. “Now you’ve got an individual user, with no programming ability, no statistics ability, just asking the question and being led through these steps,” he added.
Filed Under: clinical trials, Drug Discovery and Development, machine learning and AI