It’s been said that data is the new oil, something pharma researchers understand well. The phrase hits especially close to home for artificial intelligence (AI) researchers who work heavily with medical images. Medical imaging is a powerful diagnostic tool in the drug development process, as it can provide the efficacy and safety monitoring required in clinical trials for regulatory submission. And notably, medical images provide valuable insights into the drug mechanism of action and effects that can guide researchers during their design process.
Machine learning and AI researchers can leverage this rich medical imaging data to dramatically accelerate the drug discovery process. However, to realize this potential, AI models must be exposed to vast amounts of training data. Acquiring and curating imaging data in the volumes necessary to build accurate, non-biased models is a challenge even for organizations committed to this path of development.
In a perfect world, AI researchers would be able to easily train their models on broadly diverse and representative cohorts. But this isn’t yet the reality in the U.S. Findings published in the Journal of the American Medical Association in 2020 showed that many clinical deep learning algorithms in the U.S. have been “disproportionately trained on cohorts from California, Massachusetts and New York, with little to no representation from the remaining 47 states.”
This lack of diverse data creates a risk for bias to creep into an algorithm, impacting subsequent steps in the drug discovery process. The problem is not only that the AI may underserve populations that aren’t represented in its training, but researchers may miss valuable insights they might have gained from specific population segments — insights that might have improved their project’s overall success.
Federated approaches provide more access to relevant data
The good news is that an emerging AI training method — federated learning — promises to make it simpler for researchers to train models on high volumes of data from geographically, racially and socioeconomically diverse populations. Federated learning takes what has been a centralized process for AI training and decentralizes it—sending an algorithm to data rather than bringing data to an algorithm.
With federated learning, researchers can send their AI model to data held at external sites—universities or healthcare systems, for example—and simply collect back their weighted scores once the model has been trained on the external data. As a result, the data never leaves its owner, which simplifies compliance with data privacy regulations, eliminates administrative steps and reduces data acquisition costs. This simplification also lends itself to iteration, giving researchers more opportunities for adjusting and refining their models.
Contrast this with the well-known traditional methods of acquiring data for AI discovery. Currently, pharma researchers rely on a handful of ways to get data: they might look to their own archives for historical data they could reuse for AI training, sponsor research with an academic or hospital partner or acquire data from a data vendor.
Two of these three methods are quite costly, with data vendors charging hundreds of dollars for a single patient’s MRI exams and sponsored research costs easily running into the millions (to say nothing of a sponsored study’s lengthy timeline). Of course, data privacy is an additional challenge when acquiring external data for research — adding bureaucracy, data preparation costs, and extra time into the equation.
More data is coming due to NIH mandate
It’s easy to see the extraordinary potential of federated learning to sidestep many headaches of data acquisition. The method is also gaining awareness at the same time that the academic and clinical research communities are reckoning with longstanding data-sharing challenges. Next year, a new rule from the NIH will require its grant recipients to submit Data Management and Sharing Plans, an effort to address the scientific reproducibility crisis. The NIH is also promoting the use of Data Repositories that align with the Findable, Accessible, Interoperable and Re-usable (FAIR) principles of scientific data.
With a requirement for NIH-funded scientists to make their data available and with an ongoing emphasis on FAIR data, researchers from both public and private entities will soon be in a far better position to locate and collaborate on datasets. Until now, locating data has been a perennial challenge, but researchers will be able to actually browse and search for datasets in a catalog-like format in the near future. This new availability of data, combined with the ease of collaboration and iteration enabled by federated learning, has the potential to foster many academic and commercial partnerships.
One such collaboration occurred recently between a pharmaceutical company and a university healthcare provider. Both parties used a common data platform to ingest and curate a large volume of chest x-ray data, which helped ensure uniform preparation. AI researchers at the pharma company created their training method and sent their model to external sites. These sites received the model from the server, trained it using their respective data and sent weighted scores back to the pharma researchers.
This example is just a preliminary illustration of how federated learning might be leveraged to accelerate the pace of AI research. With more tools becoming available to discover potential collaborators and rich, diverse datasets, federated learning is poised to become an indispensable part of the next wave of AI innovations.
Jim Olson is CEO of Flywheel, a biomedical research informatics platform leveraging the power of cloud-scale computing infrastructure to address the increasing complexity of modern computational science and machine learning. Jim is a “builder” at his core. His passion is developing teams and growing companies. Jim has over 35 years of leadership experience in technology, digital product development, business strategy, high growth companies, and healthcare at both large and startup companies, including West Publishing, now Thomson Reuters, Iconoculture, Livio Health Group and Stella/Blue Cross Blue Shield of Minnesota.
Filed Under: Data science
Tell Us What You Think!
You must be logged in to post a comment.