Biology is a rapidly evolving science. Every new discovery uncovers new layers of complexity that must be unraveled in order to understand the underlying biological mechanisms of diseases and for successful drug development.
Driven both by community need and by the increased likelihood of positive returns on the large investment required, drug discovery research has often focused on identifying and understanding the most common diseases with relatively straightforward causes and those that affect large numbers of individuals. Today, companies continue to push the boundaries of innovation in order to alleviate the debilitating effects of complex diseases—those that affect smaller patient populations or show high variability from patient to patient. This requires looking deeper into available data.
The big data revolution
Key to understanding complex and variable diseases is the need to examine data from large numbers of afflicted patients. More than 90% of the world’s data has been created in the past two years, and the pace is accelerating. High-throughput technologies create ever-expanding quantities of data for researchers to mine. But in addressing one problem, another has developed—how can researchers find the specific information they need among the mass of data?
Beyond the simple issue of scale, data diversity also plays a key role. Twenty years ago, before the draft human genome sequence was finished, researchers could get a publication accepted into a journal by determining the sequence of a single gene. But with our growth in knowledge, successful research now depends more on understanding the biological complexity that comes from vast networks of interactions between genes, proteins and small molecules, not only from the sequence itself. In this environment, how can researchers determine what information is most important to understanding a particular disease?
Finding the right data
With approximately one million scientific articles published annually, scientists have a daunting task to find relevant papers for their work. They are drowning in a data deluge, and even highly complex queries return hundreds of possible answers. Twenty years ago researchers could feel fairly confident that they could keep up with the most important discoveries by reading a handful of journals. Today, important and high-quality research is published in an ever-expanding collection of journals—recent estimates from Google analytics suggest as many as 42% of highly cited papers appear in journals that are not traditionally highly cited—so researchers must cast a wide net to ensure they don’t miss key discoveries. How can they be confident that they have identified the most current and relevant research without missing a critical piece of the puzzle?
Although researchers often start to learn about a new disease using generalized search tools like PubMed or Google Scholar, more specialized tools and approaches that can connect information from multiple sources are needed to filter the massive lists of possible responses down to a manageable and relevant set. For instance, Elsevier offers research tools such as Reaxys in the chemistry space, and Pathway Studio used by biologists. The solutions include information available from Elsevier and also other publishers’ journals and articles. Each also provides focused search tools, so researchers can leverage multiple data sources and build a comprehensive and detailed picture of their disease based upon relevant data.
A “Big” project
DARPA’s “Big Mechanism” project has tasked teams from leading universities and data providers with helping improve the discoverability of scientific data. Elsevier is helping with one part of this project; developing “deep-reading” algorithms in conjunction with Carnegie Mellon Univ. to uncover almost all relevant data from a scientific publication. Understanding the role of KRAS in cancer activation was chosen as a test case due to its complexity: KRAS goes by at least five synonyms in the literature and interacts with more than 150 other proteins, many with dozens to hundreds of their own synonyms—a daunting task. Once developed, these “deep-reading” tools can be extended to work with a wide range of other genes, proteins and diseases.
Developing effective discovery tools requires significant scientific expertise to ensure data is categorized correctly in order for computers to “read” and extract the relevant data requested. As per the KRAS example, unless data is categorized correctly, a researcher could end up needing to input over 500 search terms. In short, discovery tools need extensive and refined taxonomies to be of value. A combination of deep biological domain knowledge and sophisticated software development skills are needed to develop computer-based “deep-reading” tools that can match human accuracy, while retaining the computer’s speed advantage to sift through the massive data collections.
The way we work
Understanding the way scientists do their work is essential to developing tools that match their unmet data management needs. In addition to searching a diverse collection of external data sources, researchers often have their own proprietary research data collections that must be integrated with other sources to provide the most complete picture. These tools must help the researcher identify the most relevant data for their particular task.
Since humans are very good at visually recognizing patterns in information, information should be presented in a way that lets users visualize that information. Tools that allow different views of the data can help users connect the dots and draw their own conclusions. It’s the difference between trying to read a long list of subway stations in a foreign language, and viewing a graphical map of the subway.
The research challenge
Searching the diverse collections of data to discover actionable insights into the biology of a disease is a huge challenge. The growth of data is outpacing our ability to analyze it, so new, more sophisticated tools and approaches are needed to help researchers connect the dots, no matter where that information is located. With the right discovery support, organizations can facilitate researchers’ interpretation of experimental data, leading to greater insight into the mechanisms of disease and accelerating biological research. This will help them to invent, validate and commercialize new, clinically effective treatments, faster and more efficiently.
• CONFERENCE AGENDA ANNOUNCED:
The highly-anticipated educational tracks for the 2015 R&D 100 Awards & Technology Conference feature 28 sessions, plus keynote speakers Dean Kamen and Oak Ridge National Laboratory Director Thom Mason. Learn more.
Filed Under: Drug Discovery