All too often, drug discovery organizations rely on ‘experts’ to make decisions while in many cases the data is readily available for anyone to use—if it is viewed and accessed in the right way.
Web Exclusive
A drug discovery organization’s knowledge is primarily derived from the internal data it generates. Using this available data to help direct its screening processes, an organization bases decisions such as the choice of compounds to test in primary screens on the quality of its data. If data quality is high, the knowledge derived from screening can be used to maximize the ability to find better leads. For example, previously observed information on activities from other targets for similar compounds or families can be used to more accurately assess any potential selectivity issues in a lead.
The ultimate goal of knowledge-based screening— the processing of extracting maximum knowledge from an organization’s data reserves and employing that know-how to support and improve research and business decisions—is to create virtual ‘experts’. Too often organizations rely on ‘experts’ to make decisions while in many cases the data is readily available for anyone to use—if viewed and accessed in the right way.
Hidden reservoirs of knowledge
To use this ‘hidden’ knowledge it needs to be captured, collected, analyzed, progressed through QA and QC processes, and stored in a way that is accessible and with context. There are a variety of ways in which companies may not sufficiently manage this mountain of unused data and so are unable to exploit it. Some organizations delete screening data once a project is finished, or, if this data is retained, it may only be in a summary form, with limited usefulness in terms of preserving context and ensuring data quality. Accessing data can also pose problems, especially if it is stored in disparate data sources, warehouses and silos across an organization. A consistent and coherent method of managing available data can eliminate these issues, allowing maximum value to be extracted from research information.
The value of data management
The need for an effective way to capture, analyze, QC, and report on every aspect of an experiment has grown significantly in the discovery industry in the last decade as a response to the increasingly large volumes of screening data generated by centralized screening, robotics and automation.
As data volumes increase, efficient data management becomes more and more essential to the screening process in order to efficiently store and process data in a way that organizations can make use of it.
Today’s data solutions
Today’s data management solutions have evolved to accommodate an integrated approach to data capture, storage, and analysis. Biological and chemical data, both factual and contextual, is stored in a central database. The use of robotics and automation has increased and infrastructure and hardware is greatly improved. Automated data capture direct from laboratory instruments maintains the integrity of raw data, reduces transcription errors, and provides 24/7 screening and exception handling.
Sophisticated data analysis software is now included in some modern data management solutions, enabling scientists to perform curve fitting and statistical calculations within the same environment as data capture. Quality control functionality provides data visualization and configurable business rules that can flag potential errors and automatically knock out erroneous results, bringing only spurious data to the screener’s attention.
Today, screening has adopted a production line approach, taking advantage of both Lean and Six Sigma methodologies to address workflow efficiency and improve predictability (cycle time). Screening organizations focus on low unit cost, a high standard of data quality and effective lead profile definition. Rather than being regarded as a method to find a drug instantaneously, the screening process identifies a lead series of potent molecules that conform to a predesigned drug profile with required lead candidate properties. Once identified, the lead series molecules are manipulated to promote selectivity, where a drug is active on the desired target receptor but not on others, so avoiding side effects.
For example, a molecule may show efficacy on a target receptor but may need manipulating to change an aspect of its behavior, such as the ability to dissolve, or a need to control a side effect, without affecting its potency. Looking at past experimental data, scientists can gain an insight into the potential behavior of a lead, re-using available data to avoid unnecessary research effort and help to identify potential successes.
click to enlarge Fig 1: Typical lead generation process |
Screening, as part of this lead generation process, is a multiple group discipline, involving several separate systems and/or tools to build and combine point solutions. Integrating this data often involves retrieval from a number of data silos.
Silos of data
First generation integration solutions that centered on the concept of local repositories were unscalable and costly to maintain, and therefore had limited applicability. Organizations lacked a coherent and efficient way to access, correlate and integrate information that was scattered in separate and remote data silos such as databases, data marts and warehouses and sought a single point or ‘portal’ from which to access and search all available data sources.
For example, Fig 2 shows a typical ‘unmanaged’ workflow where several databases or silos are involved in the screening process.
click to enlarge Fig 2: Typical workflow involving several data silos |
Fig 2: Typical workflow involving several data silos
Storing data in silos hinders access to and re-use of an organization’s data and knowledge. Fostering work in isolation rather than enabling cross-departmental communication, silo data storage involves transfer of information through a number of different systems and back again in order to perform a sequential process, as shown in Fig 2.
Using separate applications and databases, a chemist selects or designs compound libraries relevant to a particular target for screening.
A file is sent to a separate compound store database to physically generate the plate.
The plate is screened and results sent to a screening database.
The screening data is retrieved by chemists who use a variety of applications to analyze the structures and store the data in a structure analysis database.
The structural analysis is returned to the library design database, where the compounds are modified based on the analysis or to more fully conform to a drug profile. A new library of compounds is created and again sent to the compound store database to build a new plate.
Screening is performed on the modified compound library and results sent to the screening database.
If necessary, further structural analysis is performed, saved in the structure analysis database, and returned to the library design database, until the desired drug is created.
During this process through fragmented data sources, information can lose context and links to associated data, making its quality unreliable. ETL (Extract, Transform and Load) tools, employed when data is imported into each silo, involve data cleaning via rules that may be inconsistent between each silo, meaning data can be lost and irretrievable. Archiving rules may also lack consistency if data is removed or modified upstream.
Evolving data management
Gradually data management has evolved from the use of these several disparate data silos to a more centralized system where data is stored in and accessible from a single location. Fig 3 shows an example of a data management system that is based around a central results database. Information from supporting databases is integrated into one searchable location, significantly improving data access, querying and analysis.
click to enlarge Fig 3: One single point for data access and querying |
Data management infrastructures like the one shown above are now surmounting obstacles to knowledge-based screening by offering a better level of integration, the components of which can also integrate into existing information architecture in a multi-vendor environment. These solutions allow storage of both biological and chemical data centrally, allowing easy access to and querying of a compound’s data from screening to candidate submission. Multiple groups can share and use current and previous unambiguous project data with full experimental context, streamlining workflows and promoting communication and collaborations across an organization.
This approach has similarities to dimensional modelling as used for data warehousing. The results database becomes a ‘single point of truth’, allowing integration to supporting databases into one central location via connecting applications. This ‘single point of truth’ can be integrated with similar central databases or used to populate marts/warehouses, enabling the use of service-oriented architecture (SOA) processes and data federation tools to integrate data with external applications and sources. A centralized data management system offers a host of opportunities for the exploitation of data. Organizations have access to their own content database, which in some instances may contain more than two billion results. ETL tools with flexible open business rules can be employed to ensure quality and contextual richness, while analysis and curve-fitting tools apply consistency to results, so that like is compared to like.
Providing a potential database for building predictive models, a unified data management system allows research information to be re-used for decision support functions. Such a platform could accommodate predictive technology that alerts organizations to potential future issues with compounds and enable further chemistry analysis that helps to develop a lead series or analyze trends. Applying trend analysis algorithms widely used in other industries, such as SVM, MLR, PLS, PCA, Random Forest, and Consensus, helps to detect patterns and extract knowledge from a large amount of centrally stored data.
Most knowledge is derived from people knowing what they are looking for, such as queries retrieved from Internet search engines and selected content delivered via Web-based RSS (Really Simple Syndication) feeds. Efficient knowledge-based screening allows data to be used proactively, for example, searching for and identifying past failures and successes in studies and using that knowledge to drive current research and avoid similar costly failures.
By efficiently handling significantly increased screening throughput and encouraging data sharing across the whole discovery research community, knowledge-based screening boosts productivity and streamlines workflow. Research data can be exploited to the full, retaining its value over time and enabling re-use for data mining, predictive technology and decision support. Providing a base for informed and intelligent insights, data management enables organizations to anticipate and avoid future problems more efficiently.
About the Author
Glyn joined IDBS in 1995. IDBS is a leading software company specializing in integrated biological and chemical data management for discovery research. With over 10 years experience in Drug Discovery IT, Glyn has extensive expertise including project and product management. At IDBS he has worked on primarily large projects with many of the major pharmaceutical and biotechnology companies. Prior to IDBS he worked in chemistry including Shell Exploration and Production (Shell EP). Glyn has a BSc Hons in Combined Studies from Manchester University and an MSc in Applied Computing.
Filed Under: Drug Discovery