Tools are needed fast to control the deluge of bioinformatics data from overflowing its databanks.
Those involved in bioinformatics can tell you that their stuff, their data, is coming through the door so fast that if storage solutions are not soon found, scientific progress is going to spill right out the window.
Working to contain this onslaught is Deepak Thakkar, PhD, bioscience market segment manager of SGI, Sunnyvale, Calif. “Consider the recent availability of next-generation DNA sequencers: These machines are generating up to 15 terabytes of data in a single run.”
Thakkar suggests a field-programmable gate array (FPGA)—a high-performance, programmable, integrated circuit—linked with enhanced storage capacity. “Using this hybrid solution of FPGA and shared memory blade servers, we’ve developed a bioinformatics appliance that accelerates BLAST, HMMER … all searches.” Typically, queries are returned 60 times faster than on comparable high-end cluster solutions.
click to enlarge Heterogeneous Computing Environment: Optimal Matching of Computing Platform to Applications. Source: SGI) |
Great. So, um … blade servers? Clusters? Gate arrays? “Back when I was in the lab environment,” recalls Thakkar, “I was challenged to implement tools that were not a part of my immediate domain.” It’s daunting and may actually discourage certain experiments, which is why SGI evaluates how their products are actually being used. “That way we build plug and play solutions that are operational from the get-go.”
Aside from storage, speed, and user-friendliness, versatility is a growing concern; R&D is now a multi-disciplined effort—disparate skills, opposing applications. “If you have a genomics appliance, it wouldn’t be optimal if you were running chemistry applications,” says Thakkar. “Yet, some chemistry programs, like AMBER, require large memory.” Rather than purchase separate systems, SGI can build a hybrid solution. “You have the shared memory system, cluster solution, and the bioinformatics appliance all built in a single architecture.”
Code-talker
Once there’s proper ebb and flow to the data stream, the goal becomes using the current to do work, and for that you need new software. Again, working with next-generation sequencers, Gabor Marth, D.Sc., assistant professor of biology, Boston College, Boston, faces the algorithmic challenge.
Problem: Technology for short-read sequencing is novel, i.e., the molecular technique, the chemistry, the optics, and the signals they produce are all quite different than methods past. “It’s tricky to write base-codes for these machines, turning the signal into ATCG and assigning confidence values,” says Marth, yet, it’s critical to know the statistical likelihood that base assignments are correct. Marth found the base-caller provided with his 454 Life Sciences sequencer to be sub-optimal. “It was way off from the true sequence, so we (Marth and his graduate students) wrote this Bayesian algorithm,” he says. The program, PYROBAYES, improved accuracy. This spurred a second innovation, MOSAIC, an alignment program that takes the results of resequencing and maps it back to the reference genome. A third newly-hatched program is POLYBAYES++, a polymorphism discovery tool, and a fourth, the latest addition to the clutch is EAGLEVIEW. “This allows you to actually look at the assembly of these mapped reads, the sequences, the gene annotations, all within the context of the alignment; then you can decide for yourself if the program made the right call,” Marth explains.
All the software is currently available for data release. “If people ask, we give them the download credentials. Eventually there will be a commercialization—still free for academics, but a licensing fee for industry,” says Marth Right now, Marth is more concerned about just getting the code out there.
All together now
Getting the code out is not only good for business, it’s good for your health, and both are now global concerns. “Because of the demands of science,” says Peter Arzberger, PhD, director of the National Biomedical Computation Resource at the University of California, San Diego, “it’s much more complicated these days to do things at the cutting edge, and because of the social demands placed upon us by issues like infectious disease, or global warming, the notion of a single investigator tackling all of this …” It’s too much. That’s why Arzberger has been working for the last five years on the challenge of global distribution of computational and data resources, with particular focus on collaborations with researchers in the Pacific Rim.
The program is called the Pacific Rim Applications and Grid Middleware Assembly (PRAGMA). “What we were able to do in PRAGMA is to finally connect the dots,” says Arzberger, referring to virtual screening methods that have been developed to encompass cross-border databases. At the moment, motivation comes from the H5N1 avian flu virus. “It’s the driver to see if distributed computing will actually work. There are a lot of people working on avian flu, and it’s going to push us to understand what it really takes to distribute jobs.”
Toy story
Serious fun. That’s what Elvis Jakupovic has been having at the Molecular and Behavioral Neuroscience Institute, University of Michigan, Ann Arbor. As part of a team working in the lab of Dr. Fan Meng, Jakupovic, a systems programmer, turns toys into supercomputers. Jakupovic and others came upon the idea after tinkering with GPUs (graphics processing units) from NVIDIA. They made the connection that they could do the same thing with a Sony Playstation 3 (PS3), which had just come to market. “They’re both very good for vector processing,” says Jakupovic, “so the programming challenges are very similar.”
Using a software development kit from IBM, the PS3 was tricked out with Linux. The PS3 can be connected to a network—the port is already there. Or you can hook up a keyboard, mouse, and monitor, and then program it.
PS3 is already being used to fold proteins. Folding@home, the distributed computing project at Stanford University, Palo Alto, Calif., has PS3 folding programs available for download.
About the Author
Neil Canavan is a freelance journalist of science and medicine based in New York.
This article was published in Drug Discovery & Development magazine: Vol. 10, No. 12, December, 2007, pp. 24-26.
Filed Under: Drug Discovery