PAGE 2 OF 5
Left: Roian Egnor Right: Sean Eddy
Humming nonstop in Janelia's compact computing center are 4,000 processors, 500 servers, and storage machines holding half a petabyte of data—about 50 Libraries of Congress worth of information.
Though there are many larger clusters around the world, this particular one is just right for Janelia Farm. “Beautifully conceived, ruthlessly efficient, and extraordinarily well run by the high-performance computing team,” according to Janelia researcher Sean Eddy, the system is designed to make digital images available lightning fast while muscling through the monster calculations required to help investigators conduct genome searches and catalog the inner workings and structures of the brain.
A group leader at Janelia Farm, Eddy deals in the realm of millions of computations daily as he compares sequences of DNA. He is a rare breed, both biologist and code jockey. “I'm asking biological questions, and designing technologies for other people to ask biological questions,” he says.
Eddy writes algorithms to help researchers extract information from DNA sequences. It's a gargantuan matching game where a biological sequence—DNA, RNA, or protein—is treated as a string of letters and compared with other sequences. “From a computer science standpoint, it's similar to voice recognition and data mining,” he says. “You're comparing one piece against another. We look for a signal in what looks like random noise.”
Eddy looks for the hand of evolution in DNA by comparing different organisms' genomes. He's searching for strings of DNA sequences that match—more than random chance would dictate.
“It's a lot like recognizing words from different languages that have a common ancestry, thus probably the same meaning,” he explains. “In two closely related languages—Italian and Spanish, for example—it's pretty obvious to anyone which words are basically the same. That would be like two genes from humans and apes.”
But in organisms that are more divergent, Eddy needs to understand how DNA sequences tend to change over time. “And it becomes a difficult specialty, with serious statistical analysis,” he says.
From a computational standpoint, that means churning through a lot of operations. Comparing two typical-sized protein sequences, to take a simple example, would require a whopping 10200 operations. Classic algorithms, available since the 1960s, can trim that search to 160,000 computations—a task that would take only a millisecond or so on any modern processor. But in the genome business, people routinely do enormous numbers of these sequence comparisons—trillions and trillions of them. These “routine” calculations could take years if they had to be done on a single computer.
That's where the Janelia cluster comes in. Because a different part of the workload can easily be doled out to each of its 4,000 processors, researchers can get their answers 4,000 times faster—in hours instead of years. The solutions don't tend to lead to eureka moments; rather, they provide reference data for genome researchers as they delve into the complexities of different organisms. “These computational tools are infrastructural, a foundation for many things,” Eddy says.
Photos: Paul Fetters