
[Adobe Stock]
Now thousands of volunteers are working to “unroll” the scrolls virtually, using x-rays and machine learning to decipher layers of paper and ink. Eventually, we may learn what these scrolls say. It’s a marvelous use of AI, with lessons not just for archaeologists, but also for biopharma. After all: data storage is data storage.
Like archeologists, biopharma companies carefully store old data. Originally, old data was often stored to prove IP claims or meet regulatory requirements. But now that large language models have surpassed humans in parsing large data sets, the hope is that archival data can also yield new meaning.
When the right tools arrive, old data can save time and add value: In silico investigations are often cheaper than wet science. Making in silico predictions based on treasure troves of old data can save time and money by helping companies be more selective about which in vivo or in vitro experiments they take on.
But currently, many data science departments spend a large portion of their time just making data useable. Of course, some data teams simply dump data they can’t currently use. But just as volunteers are now painstakingly reassembling ancient scrolls, other data teams painstakingly clean and organize and compile. Some companies are ahead of the curve: one client we worked with saved everything in multiple open formats to avoid corrupted files — the modern-day version of a crumbling scroll.
But future proofing file formats is just the beginning. Whether storing experimental data in file boxes or in the cloud, many biopharma companies dump data in storage and think “we will sort it all out later.” Some day. In the future. “When the AI gets better.” But for AI to contribute something meaningful, and for in silico experiments to be a success, it’s not enough for data to be intact. It also needs to be complete and to include the right context.
Deciphering ancient scrolls is a technical marvel. But as a source of knowledge, those scrolls may still be a letdown. We may wind up with a scroll containing the first half of a story, but not the resolution—an unresolved thousand-year cliffhanger. Or we may find a letter that changes everything we thought we knew about the ancient world—with no way to tell whether the ideas it contained were mainstream or totally fringe.
“Data ready” companies make sure their data captures the full story — even if they aren’t ready to use it yet. To get your scrolls in order: capture enough data, capture data in a consistent format, and invest in both company-wide norms for data stewardship and tools to help staff follow through on those goals.
Recommendations for ensuring your data is AI ready
First, capture enough data. In the biopharma context, machine learning is already adding real value by comparing across large data sets, spotting patterns and trends, and making predictions. But those inferences are only as good as the data they start with. If your data lacks content and context, analysis may yield incomplete or misleading results.
1. Capture negative and null results
Most importantly, capture negative and null results. One client I visited was deep into an experiment that they could tell was going nowhere. So, they got lazy: they started capturing only their success data. They reasoned that since they planned to pull the plug anyway, the null results were not important.
Although failure data may not be easily recognized as valuable by humans, it is crucial to machine learning. AI can’t read minds. It makes educated guesses based on what it knows. If a scientist just keeps a mental note of what didn’t work and shared that information informally, a machine will miss the memo.
You don’t have to keep running an experiment if it is clearly not working. But you should document your failures as carefully as you would the successes. Leaving out failure data can introduce biases and create misleading predictions. It also means missed opportunities, since machine learning can connect dots that humans would miss.
2. Capture all details
For the same reason, when recording experiments, don’t just track the conditions you need for that experiment. Track anything that might help put the experiment into context in the future. Even if something doesn’t seem relevant now, it may become so later as LLMs connect the dots across experiments and teams.
Capture everything you can. Capture which instrument you used. Capture which sample batch you used, the lineage of your samples, and details of the experimental conditions. For legal purposes, capture any restrictions on the data, like where it needs to be stored or if it can be used for some purposes but not for others.
3. Use a consistent format
Next, capture data in a consistent format. Doing so can save your data team time later as they clean data for in silico investigations or other forward-looking projects. Pick and enforce standard formats, starting with the low hanging fruit: using a consistent ISO time and date format across countries, for example, can data teams save time and expense later.
All this data capture can add time, but the expense now can pay dividends and prevent headaches later. The easiest way to do all this is to use systems that enforce accuracy: connected instruments that automatically populate data; forms that prompt a particular date format, or workflows that don’t let the user submit until they fill out each section. Then, a data management system that captures complex data in relational data tables can ease searching and analysis across time. All these efforts can help teams harmonize data across silos and make old data useful.
But moving towards those systems is an ongoing process. In the meantime, get stakeholders on the same page about which data needs to be captured, and reinforce a culture that respects data as a future treasure trove.
The first meaningful treasure that the volunteer archeological sleuths found in those scrolls was the word purple; the word, and the scroll, were associated with an ancient philosophical tradition that prioritized enjoying the good life.
Good data management may not feel like the good life right now, but it can make life good in the future! Clean, complete, well contextualized data can help your past yield new meaning — and keep your company off the trash heap of history.

Stuart Ward
About the author
Stuart Ward is Head of Platform Strategy at IDBS and is responsible for ensuring that IDBS’ software platform and products meet the needs of customers today and in the future. The Platform Strategy team provides the necessary business, technical and domain experience along with customer and market insights required to create software and solutions to enable BioPharma and other industries to achieve faster scientific breakthroughs. He led the creation and launch of The E-WorkBook GxP Cloud, which was IDBS’ first SaaS product for use in regulated (21 CFR Part 11, GxP) environments. Before starting this role in January 2014, he was Product Manager for E-WorkBook for four years and worked in IDBS Global Professional Services for five years, responsible for deploying IDBS’ products both from a technical and project management perspective. Prior to working at IDBS, Stuart completed a post-doctoral fellowship at the NIH and then worked for Ionix Pharmaceuticals. Stuart obtained his Ph.D. in Pharmacology from the MRC National Institute for Medical Research (University of London).
Filed Under: Uncategorized