Drug Discovery and Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE

How a data lakehouse can give you a panoramic view of your AI-enabled clinical trials 

By Brian Buntz | November 2, 2023

Lakehouse Sunset

[Outlier Artifacts/Adobe Stock]

In recent years, the term “data lakehouse” has entered the lexicon of data professionals. For AI-enabled clinical trials, the lakehouse architecture promises seamless integration of diverse data streams, spanning patient health records to real-time sensor data, all processed efficiently and queried in structured formats.

The lakehouse architecture aims to provide a comprehensive overview of data, ensuring both vast storage and real-time processing capabilities. In other words, the lakehouse offers the “best of both worlds” when it comes to data warehouses and data lakes, according to Venu Mallarapu, vice president of global strategy and operations at eClinical Solutions.

AI and ML move from buzzwords to practical tools in clinical trial management

As the use of AI and ML in clinical trials becomes more prevalent in patient recruitment, real-time data monitoring and beyond, the lakehouse architecture could provide a seamless and integrated data management option. That’s because it can bridge the gap between the structured querying capabilities and performance of a data warehouse and the scalable storage and flexibility of a data lake. Notably, the application of AI and ML in data management within the lakehouse architecture can significantly bolster clinical trial efficiency. In addition, the ability to source data directly from electronic health records (EHRs) and electronic medical records (EMRs) can bypass the need for traditional electronic data capture (EDC) systems, further streamlining the data management process.

In addition, the increasing use of wearable sensors in clinical trials, can generate rich time-series data that a lakehouse environment can help manage and interpret. Despite the promise of the data lakehouse approach, adoption remains at an early stage. Migrating existing on-premise databases to the new architecture is one potential barrier, given the technical and change management challenges that can arise when adopting a new architecture. Another factor is the inertia of well-established traditional data warehouses, which have been foundational in enterprise data management for decades.

Data warehouse: Traditional strengths and weaknesses

Venu Mallarapu,

Venu Mallarapu,

Between data warehouses and data lakes, data warehouses have been around much longer. The genesis of the data warehouse concept dates back to the 1980s and early 1990s, when the computer scientist and prolific author Bill Inmon popularized the concept. The idea was to centralize data traditionally stored in silos, offering a horizontal view of organizational information. The architecture has since cemented its status as a foundational component in enterprise data management. “That architecture is quite conducive for structured data and data that is in rows and tables,” Mallarupu said.

Traditionally, they have excelled at extracting data from transactional systems or systems of record, transforming the data into a chosen format, and loading it into a repository. This process is often referred to as “ETL,” an abbreviation of extract, transform, load.

At present, many organizations focused on clinical trials continue to rely on the data warehouse approach. But as the clinical trial landscape evolves thanks to the rising reliance on decentralized clinical trials, omics data, and other advanced scientific methods, the data warehouse approach is less ideal. “That’s where the need for the lake house is coming into play,” Mallarupu said.

The rise of data lakes

Data Lake - Single Store of Data for Advanced Big Data Analytics and Machine Learning - Centralized Repository to Store Structured and Unstructured Data at Scale - Conceptual Illustration

[ArtemisDiana/Adobe Stock]

The data lake first became popular after James Dixon, then the chief technology officer of the business intelligence firm Pentaho, coined the term in 2010 to refer to vast repositories of raw data. “Data lakes came along with the advent of the cloud and cheaper storage,” Mallarupu recalled.

Data lakes can store large volumes of raw data, both structured and unstructured. The idea is that a data lake can hold vast amounts of data in its natural, raw state. Data lakes allow users to perform traditional ETL processes on structured data but also extract information from unstructured data. Such unstructured sources can include text, images, audio, and video that lack a predefined data model.

Compared to data warehouses, data lakes offer considerably more flexibility as they can ingest data in real-time from an array of sources without the need for an immediate structure or schema. In recent years, pharma companies such as Bristol-Myers Squibb (BMS), Takeda Pharmaceutical, and Amgen have implemented data lakes to boost the speed and efficiency of their research processes. But data lakes can be difficult to configure. The architecture can prove challenging when it comes to maintaining data integrity. And without proper governance, a data lake can degrade into a data swamp. Despite their flexibility, it takes work to maintain data integrity and quality in a data lake.

Enter the data lakehouse architecture: A powerful foundation for AI

As mentioned at the outset, a lakehouse is a hybrid approach that offers the best of both worlds. The environment is a good fit for data-hungry AI/ML projects, Mallarupu said. The environment offers a repository that can store data, whether it is structured, unstructured, or even semi structured. And in contrast to a data lake, the lakehouse offers the structured querying and data management capabilities of data warehouses​. ​​

For AI-enabled clinical trials, the lakehouse architecture unifies diverse data streams, from patient health records to real-time sensor data, all processed efficiently and queried in structured formats. By providing a comprehensive overview of data, including vast storage and real-time processing capabilities, the data lakehouse can support data scientists’ need. “You can draw the data that you need for AI training, testing, and validating the AI/ML models directly,” Mallarupu said.

In addition, lakehouse architecture has the potential to support the application of generative AI (gen AI), given that it can handle diverse data types and ensure data privacy.

As AI/ML tools emerge to address specific needs in clinical research, they promise to boost the efficiency and accuracy of various aspects of clinical trials, spanning patient recruitment, data analysis and decision-making. But implementing AI/ML in a highly regulated environment is not without challenges. One of the main hurdles is finding ways to tap such powerful tools while maintaining data privacy and complying with stringent regulations. The lakehouse architecture, with its robust security measures and compliance features, can help address these challenges by providing a secure and compliant environment for data processing and analysis.

Burgeoning support for the data lakehouse architecture in clinical trials

eClinical Solutions uses the architecture in its platform, elluminate, to help pharma R&D professionals with decision-making in clinical trials. More than 100 life science organizations are using the platform, including Bristol Myers Squibb, bluebird bio, Jounce Therapeutics, Agios, and Urovant Sciences.

In the broader landscape, other companies have adopted the data lakehouse for clinical data management and analytics, such as Amgen and Verana Health.

The data lakehouse architecture supports real-time data monitoring, which, for clinical trial data, is important as it can detect any anomalies that could indicate safety problems, allowing for immediate action to be taken to address any potential risks to patient safety. While a data warehouse could potentially support real-time data monitoring, a data lakehouse is more efficient in this regard, given the architecture’s ability to seamlessly integrate diverse data streams, handle vast volumes of structured and unstructured data, and provide both scalable storage and agile querying capabilities.

Organizations using the architecture can also develop purpose-built applications or products while also furthering AI initiatives. Companies that succeed in tapping AI and ML to drive decisions, whether operational or scientific, will have “a huge advantage,” Mallarapu said. In this context, the data lakehouse architecture provides a solid foundation that “not only meets your needs today but is future-proof,” he concluded.


Filed Under: clinical trials, Data science, Drug Discovery, machine learning and AI
Tagged With: artificial intelligence, biopharma R&D, clinical trials, data lake, data lakehouse, data management, data warehouse, electronic health records, real-time data monitoring
 

About The Author

Brian Buntz

As the pharma and biotech editor at WTWH Media, Brian has almost two decades of experience in B2B media, with a focus on healthcare and technology. While he has long maintained a keen interest in AI, more recently Brian has made making data analysis a central focus, and is exploring tools ranging from NLP and clustering to predictive analytics.

Throughout his 18-year tenure, Brian has covered an array of life science topics, including clinical trials, medical devices, and drug discovery and development. Prior to WTWH, he held the title of content director at Informa, where he focused on topics such as connected devices, cybersecurity, AI and Industry 4.0. A dedicated decade at UBM saw Brian providing in-depth coverage of the medical device sector. Engage with Brian on LinkedIn or drop him an email at bbuntz@wtwhmedia.com.

Related Articles Read More >

FDA’s genAI push could save CDER hundreds of thousands of review hours annually
Elsevier plugs 500,000 ClinicalTrials.gov records into Embase
Transparent padlock, digital security concept, glowing background. AI generated
The challenge of AI inventorship in healthcare
From Graz to global — Innophore’s journey with NVIDIA’s BioNeMo
“ddd
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest news and trends happening now in the drug discovery and development industry.

MEDTECH 100 INDEX

Medtech 100 logo
Market Summary > Current Price
The MedTech 100 is a financial index calculated using the BIG100 companies covered in Medical Design and Outsourcing.
Drug Discovery and Development
  • MassDevice
  • DeviceTalks
  • Medtech100 Index
  • Medical Design Sourcing
  • Medical Design & Outsourcing
  • Medical Tubing + Extrusion
  • Subscribe to our E-Newsletter
  • Contact Us
  • About Us
  • R&D World
  • Drug Delivery Business News
  • Pharmaceutical Processing World

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search Drug Discovery & Development

  • Home Drug Discovery and Development
  • Drug Discovery
  • Women in Pharma and Biotech
  • Oncology
  • Neurological Disease
  • Infectious Disease
  • Resources
    • Video features
    • Podcast
    • Webinars
  • Pharma 50
    • 2025 Pharma 50
    • 2024 Pharma 50
    • 2023 Pharma 50
    • 2022 Pharma 50
    • 2021 Pharma 50
  • Advertise
  • SUBSCRIBE