 |
Computational Methods for Identifying the Molecular "Parts Lists" of Cells

Summary: Philip Green develops mathematical, statistical, and computer methods for analyzing the genomes of humans and other organisms. He has written a number of software packages that are widely used in the Human Genome Project. Green's programs have been used to process and assemble DNA sequencing data, make the genetic maps that are used to localize the genes for genetic diseases, and identify genes and other biological features in the genome sequence.
Our goal is to help provide the computational methods necessary to achieve a complete, quantitative understanding of how cells function at the molecular level. Such an understanding will require three things: a "parts list," or catalog of all cellular molecules; a "wiring diagram" that specifies the interactions that occur between those molecules; and, finally, quantitative models of systems of interacting molecules. Although substantial work remains, the advent of large-scale genome sequencing is bringing the possibility of completing the parts list within view. Most current research in molecular biology is directed at filling in the wiring diagram (i.e., specifying molecular "function"). The modeling of molecular systems, still in its infancy, will become increasingly important as the wiring diagram approaches completion and our ability to quantitate cellular molecules accurately improves.
Most of our research has been directed at constructing computational tools to support the acquisition of the parts list, in the form of a gene-annotated genome sequence. Initially this involved developing methods for constructing the genetic and physical genome maps that are preparatory to genome sequencing. We developed (in collaboration with Eric Lander, Whitehead Institute/MIT Center for Genome Research) improved methods for making maps of multiple genetic loci and implemented these in a computer program, crimap; this made it possible for us (with scientists at Collaborative Research and the Whitehead Institute) to construct the first genetic linkage map spanning the entire human genome, giving the chromosomal locations of genetic markers that can be used in localizing the genes for genetic diseases. We subsequently developed strategies and data analysis methods for constructing physical maps to localize cloned DNA segments along chromosomes; these methods were implemented in segmap, which was used to make clone maps of a number of human chromosomes.
Our more recent work has focused on developing tools for assembling and interpreting the genome sequence. For processing of genome sequence data, we developed phred, a program that performs base calling and quality assessment of the raw data from automated DNA sequencing machines; phrap, which assembles the sequence reads produced by phred to infer the underlying sequence; and consed/autofinish, which allows manual review and editing of the sequence data and automates the process of choosing additional reads to bring the sequence to a high final accuracy. These programs are widely used in laboratories doing large-scale DNA sequencing. For interpretation of the sequence, we developed genefinder, which uses probabilistic models to delineate genes within genomic sequences and has been the primary tool for gene identification in the nematode Caenorhabditis elegans genome project, among others; and with Arian Smit (Institute for Systems Biology) we developed repeatmasker, which is used for detecting interspersed repetitive elements in mammalian DNA.
Common themes in this work include the development of appropriate probabilistic models for the type of data to be analyzed, the construction of efficient algorithms to carry out the probabilistic calculations, and the implementation of the algorithms in software that is then made widely available. Probabilistic methods have been particularly crucial, a reflection of the inherently probabilistic nature of biological processes such as meiotic recombination and sequence evolution, as well as of laboratory data. One recent example is our development of error probabilities associated with the base calls for sequencing reads. This has turned out to have a number of applications, including more effective quality control at the raw data collection level, more accurate assembly of reads, a useful criterion to guide sequence finishing, an objective measure of the accuracy of the final sequence, and an effective tool for discriminating true sequence differences among individuals from sequence errors.
Reliable identification of the protein parts list from a genome sequence remains an unsolved problem. We are attacking this on several fronts, including improved probabilistic modeling of the genomic sequence, comparison to evolutionarily related sequences, and more effective utilization of available data. Some of our work draws on our experience with sequence data processing to assemble expressed sequenced tags (ESTs), partial gene sequences that have been generated in a number of laboratories and submitted in unassembled form to the public databases. Prior to the availability of the full genome sequence, we used our EST assemblies to conclude that the number of genes in the human genome is substantially lower (about 35,000) than had been previously thought. We are applying the assemblies to make more reliable inferences of protein-coding sequences, to catalog alternative splicing (which can result in multiple proteins being encoded by the same gene), and to discover polymorphisms (differences in sequence among individuals). We have begun a laboratory effort to test our gene predictions systematically.
The availability of sequence data from evolutionarily related organisms provides a powerful tool for identifying genes and illuminating their function. Through comparisons of yeast, human, and nematode sequences, we observed a number of years ago that a substantial fraction of genes (approaching 50 percent) appeared unique to an organism and its close relatives, an observation that has been repeatedly borne out with each new genome sequence that has been obtained. Most likely many of the "unique" genes do in fact have evolutionary homologs in more distant organisms but are simply evolving too quickly for the relationship to be detected readily, and we have developed methods for more sensitive detection of evolutionarily conserved sequence features. Evolutionary data should help us understand the wiring diagram of molecular interactions, since it is primarily these interactions (including the self-interactions that determine tertiary structure) that constrain the allowed residue substitutions. We are working on improved probabilistic models of sequence evolution in the hope that these will allow such functional inferences.
Some of our research has been supported by the National Institute of Human Genome Research and the Department of Energy.
Last updated: February 15, 2007
|
 |
|
 |