PAGE 3 OF 5
Constance Cepko The challenge of storing the vast amount of data generated by her research has her searching for commercial solutions.
“Computer hardware and software quickly become obsolete, so that unless you hold on to your old computers the data you backed up with them may become difficult if not impossible to recover,” says Terrence J. Sejnowski, a computational neuroscientist and HHMI investigator at the Salk Institute for Biological Studies in La Jolla. “It's something we have to live with.”
Retaining all that material is easier said than done, however.
“It's a problem for everybody,” says HHMI investigator Constance L. Cepko, a neurobiologist at Harvard Medical School who studies the structure and function of the eye in vertebrates. “In trying to link DNA clones, in-situ images, and microarray data, we can generate 30,000 data points in one experiment.” She and her colleagues considered commercial data-management packages and high-tech start-up services for archiving such data, but none filled their needs. At present, an M.D.-Ph.D. student is setting up a customized relational database, but it is just a temporary solution.
Cepko says that because the volume of data her lab generates is rapidly filling servers, she is looking to a centralized archiving system, such as the Mouse Gene Expression Database at the Jackson Laboratory (TJL) in Bar Harbor, Maine, to take some of the data off her hands. TJL aims to make the database, funded by the National Institutes of Health, the leading archive of mouse genomic and proteomic data, and is actively soliciting and adding primary data to its curated, annotated database.
In much the same spirit, Sejnowski has an agreement with the San Diego Supercomputing Center, which maintains and archives all of his lab's large data sets. “You have to find a partner,” he insists. “Data have become so unwieldy that managing them is too much for any one lab to handle on its own.”
HHMI investigator Norbert Perrimon, who studies cell signaling at Harvard Medical School, found the solution to his data-management problems—at least, for the time being—by setting up a centralized public database to store the results of his lab's RNA interference screens in Drosophila. Its infrastructure was funded by a grant from the National Institutes of Health, which allowed him to hire two full-time programmers to get the job done.
But in the long run, the solution will depend on cheaper ways of storing data as well as being more selective, says Perrimon. “The issue that we are facing now is that we do not yet know what is worth keeping in these large-scale studies because the [RNAi] field is not very mature yet. We need to spend more time on data analysis to figure out what has real value in the data sets.” So, for the time being, he is storing it all.
Paul W. Sternberg, an HHMI investigator at the California Institute of Technology, believes the answer may lie in more intelligent searching. “My general feeling is that we know a lot more than we think we do in biology,” he says. “We aren't taking full advantage of what already exists out there. Digital storage is cheap. We should be archiving and making retrievable unpublished primary data.” He is working on systems that will allow scientists to combine primary data from disparate sources, allowing them to develop new hypotheses by combining what he calls “weak hints,” which tend to be overlooked when sources are assessed individually.
In the March 10, 2006, issue of Science, Sternberg and colleagues described how to apply such a computational approach to integrating published data on how genes interact with each other in roundworms, fruit flies, and yeast. “We now know that mining published and available data is valuable,” Sternberg says. “Imagine what we could do if we could access the likely larger amount of unpublished information.”
Sternberg believes this idea also extends to updating that laboratory mainstay, the lab notebook. “The new generation is more comfortable with electronic notebooks,” he says. One of his graduate students keeps a personal blog on the lab's private intranet for recording observations and ideas. “I would have kept that kind of thing in a margin of my [paper] notebook,” says Sternberg. “But then how would I ever find it again? In digital form, you can search and organize thoughts and ideas—and have instant recall.”
Photo: Jason Grow