An intelligent approach to data cleaning - Drug Discovery and Development

Image courtesy of Pexels

The collection of good quality data from clinical trials is essential to data analysis to produce robust results that meet the precise requirements for regulatory need. Data from clinical trials are increasingly complex, related to involved protocols, the geography of trial sites, increasing data streams and technological advances. Therefore, studies must be set up to be efficient, offer support and training to trial sites and ensure that the right data are collected correctly.

Data Management teams have historically reviewed data once source document verification (SDV) has taken place by the Clinical Monitors on an ongoing basis. Data issues can be actioned early in the trial, and corrective action put in place. This activity has been a very manual process, thorough and time-consuming and has left less time to focus on insightful data analysis.

Collaboration: Data science and data management

Data science and data management teams have begun successfully collaborating to explore ways in which data quality can be improved efficiently. Data science provides opportunities and techniques to highlight potential data issues. Data management applies expertise to highlight the critical data and analyze the results of data science review techniques. Each approach is first implemented in small studies and integrated into the overall process once shown to be effective. Implementation is a multi-faceted approach applying data science approaches within data management to drive time and cost efficiencies while ensuring high-quality data.

Different approaches have been implemented to simplify the search space for data managers. For example, there may be thousands or even tens-of-thousands data points in any given clinical trial, a handful of which may be problematic.

Adverse event tracking calls for a careful approach to data management.

Traditionally, data review would occur at the data point to identify the data issues (both with electronic programmed checks and manually) and raise individual queries to the site. An intelligent approach can reduce this search space. A data management expert will still need to review the data issues and inconsistencies but will be looking at a reduced field, which drives efficiency.

Smarter data science leading to efficiency

A variety of different data science techniques are being explored in the context of data management to build usage, expertise and drive efficiencies. These include the application of AI to deliver insights around the most problematic areas of data collection on a trial. For example, issues could relate to database design, a particular data source, local trends, training, and support requirements. A data management study expert interprets the insights to understand the context and act accordingly. Additionally, a rule-based approach has been successfully implemented to highlight complex inconsistencies. In a human-machine hybrid approach, this method narrows down the search space for data reviewers to prioritize efforts.

Another key area of data science that has a large impact on the data manager’s role is the development and application of visual analytics, creating effective visualizations to monitor patterns in the data during the trial.

Data visualizations

Clear data visualizations can facilitate risk-based and centralized statistical monitoring.

This article concentrates on using data visualizations to support the data management teams. Interactive visualizations enable teams to focus on specific risk areas. For example, to support risk-based and centralized statistical monitoring, the remote evaluation and analysis of data generated during the clinical trial. This approach focuses on the data, highlighting outliers and deviations from the mean, identifying data errors and even potential fraud.

Data quality is essential for the success of a clinical trial. Traditionally 100% SDV was done at the site where source notes were checked against the eCRF data, where there are differences, data queries would be raised. Data management would then review data after SDV to detect further data queries. However, this approach has caused challenges during the COVID pandemic as monitors have been unable to visit sites.

Risk-based approach with ICH-E6-R2 regulation

Regulatory guidelines now support a different approach, with ICH-E6-R2 acting as a driver for a risk-based approach. The team focus is on the high-risk items rather than all data. This approach is a general framework whereby the study team identifies key risks up-front (high-risk data items and possibly high-risk sites) and then monitors for these risks as the study is ongoing. This helps to reduce the spread of data monitoring required and focuses effort on identifying, assessing and mitigating the risks that could impact the quality or safety of a study.

This is further supported by studies that have looked at query effectiveness. Reports suggest that while approximately 40% of manual queries raised led to a change in data, this amounted to less than 1% of the data in the database. Therefore, a large percentage of queries do not affect the overall data. However, with an estimated cost of each manual query being up to $200 (U.S.), it is not the most efficient approach to ensuring a quality database.

Once risks are identified, they can be monitored using different tools and platforms that apply rules and scoring approaches for tolerance limits. Examples of which include:

Study conduct: close monitoring of behavior across subjects, sites, countries that may trigger specific actions.
Safety data across trials: to both ensure and improve patient safety, often a critical aspect.
Data Integrity: identifying corrupt, erroneous or missing data.
Compliance: monitoring protocol deviations, for example.
Enrollment ensures that suitable patients are recruited to trials.

The approach allows the study team to concentrate on high-value tasks for their particular study.

Visualization tools

In support of Risk-Based Monitoring activities, we have a visualization tool that was developed in a collaboration between data management and data science. To ensure success, data science needed to understand the critical requirements within each trial to enable the teams to monitor critical risks effectively and to understand utility across studies to remove duplicated effort.

The critical elements have been around, making the visualizations intuitive and valuable to the user. Accuracy and near-real-time data are essential to utility.

It utilizes three types of data for monitoring these include the clinical, metric and audit data:

Clinical trial data at the subject level; can include physical measurements such as vital signs or laboratory measures, adverse events, or efficacy measures.
Metric data provides insight into data queries raised about data inconsistencies. It allows study teams to look at the numbers of queries, and site responsiveness, which can be useful to monitor site activity.
Audit Data is an overview of all data in the eCRF/Clinical database up to that point in time. It provides dates of data entry and subsequent edits. The data reflect the latest change. The audit provides a history of how each data point has changed over time, when and by whom. This is an honest representation of events and can be invaluable when looking at site and user behaviors. Not only can it aid with addressing problems with data input and areas of data collection (so mitigations can be put in place), but it also has the potential to detect possible fraudulent activity.

Implementing a visualization tool across an increasing number of studies enables the team to build a standard library of visualizations. These are pre-built to allow teams to monitor key risk areas common to many studies, such as adverse events and laboratory data. These off-the-shelf visualizations are the core parts of the tool.

The utilization of all three elements, the clinical, metric and audit data as described, are essential when monitoring the clinical trial. In addition, they enable the teams to monitor different aspects of the data collection process.

Interactive and customized visualization and integration

Visualizations are interactive, enabling stakeholders to drill down into specific areas of interest. This can be extremely valuable as the amount of data generated increases significantly on a trial. Traditionally these data were viewed as static plots. While useful, it required teams to follow up on large data tables to locate the points of interest and understand the context.

The data science team has also invested in building bespoke visualizations on different types of data or monitoring risks specific to a given study. This involves working closely with the study team experts to understand requirements.

Different types of data can be integrated where appropriate. For example, Electronic Patient-Reported Outcome data (ePRO) brings challenges to teams because of both the volume of data generated and monitoring items such as subject burden (how the subjects use digital tools and the possible impact on them).

Clinical data visualizations which focus on adverse events can identify under- and over-reporting. In addition, laboratory data looking at the different measures can identify outliers, particularly at the site level and changes from baseline.

Different data types can be brought together. For example, clinical events with respect to data entry can help identify problematic site behaviors. In addition, site summaries provide a high-level view of the site performance against specified risk items.

As part of the risk-based monitoring process, the review frequency is planned up-front and will drive how often visualization is used for risk-based monitoring. Interactivity provides a huge value-added over listing and chart reviews that are traditionally used. They allow the user to zoom in, which is especially useful when the plots are quite busy with many data points. Interactivity also enables for selection of data points in the plot. This can help highlight specific issues, allowing the user to identify anomalies and or patterns in data, which may be pertinent to the risks they are monitoring.

Practical usage

Data visualization interactivity is enhanced with tooltips. The user can hover the mouse cursor over a particular data point or bar in a chart to provide richer information.

These visualizations support both risk-based and centralized statistical monitoring and are valuable to the data management process for study teams.

Historically we have seen study teams rely on static tables for insight generation and monitoring during a clinical trial. While useful, these can be time-consuming and potentially error-prone to process, as teams manually create Excel files and summarize them in charts.

Interactive visualizations in dashboards can provide valuable tools for teams, making the monitoring process more efficient overall. It enables real-time monitoring and the team to focus efforts on the high-risk data and sites.

Further work will include the development of algorithms for richer insights and alerts for users relative to specific thresholds. Sophisticated visualizations will support a deeper drill down into the data aligned with particular insights, bringing further benefits to the users.

Summary

In summary, using an intelligent approach to data cleaning delivers the primary aim of driving efficiency in the overall process, with multiple activities ongoing. This is key to ensuring effective, close collaboration between data management and data science teams. Data management brings its specialized expertise and data science their technical expertise — the teams working together to review impact and usability in evolving cycles.

Helen Smith is the senior data coordinator at Phastar.

Filed Under: Data science
Tagged With: data cleaning, data quality, data science, Phastar

Collaboration: Data science and data management

Smarter data science leading to efficiency

Data visualizations

Risk-based approach with ICH-E6-R2 regulation

Visualization tools

Interactive and customized visualization and integration

Practical usage

Summary

Related Articles Read More >

The $5-7B generative AI opportunity biopharma can’t afford to ignore

Demystifying deep learning: An accessible introduction to neural networks in health research and epidemiology

Global biotech VC trends in Q1 2024

Unlearn CEO: Digital twins could slash clinical trial patient enrollment by 25% or more

Search Drug Discovery & Development