 |
Algorithms for Genome Analysis

Summary: David Haussler is developing new statistical and algorithmic methods to explore the molecular evolution of the human genome, integrating cross-species comparative and high-throughput genomics data to study gene structure, function, and regulation.
My genome informatics team has participated in the public consortium efforts to produce, assemble, and annotate the first mammalian genomes. As collaborators in the Human Genome Project, we built the program that assembled the first working draft of the human genome sequence from information produced by sequencing centers worldwide, and we participated in the informatics associated with the finishing effort. We provide an interactive genome browser for the human, mouse, rat, and other genomes that is used by thousands of biomedical researchers every day (genome.ucsc.edu). By integrating multiple sets of high-throughput genomics data, computational predictions, and curated genomic feature sets from dozens of laboratories, the browser provides a new kind of computational microscope for exploring genomes.
Our work developing and annotating genomes for the browser provides both a foundation and a Web-based forum for our scientific efforts. These are directed at the large-scale discovery and characterization of the functional elements in mammalian genomes through comparative sequence analysis, the study of mammalian molecular evolution, and the integration of an increasing variety of high-throughput data sets provided by functional genomics efforts.
As our computational laboratory has generated increasingly interesting findings, we have established a new wet lab to test these hypotheses.
Throughout the approximately 75 million years since the human species diverged from its common ancestor with the rat and mouse, the three genomes have independently accumulated many changes, leading to the three different species we see today. Reconstructing these changes by computational analysis has given us a new understanding of mammalian genome evolution. In comparisons of the human, mouse, and rat genomes, we have found that the rate of neutral substitution varies regionally along the chromosomes. The mechanistic explanation of this variation has not yet been found. We determined that a core of about 40 percent of the human, rat, and mouse genome sequences derives from a common ancestor, and we produced base-level alignments between the three genomes in these regions. This alignment, combined with characterization of neutral substitution rates, led to the estimate that approximately 5 percent of the human genome is under purifying selection.
We suspect that these conserved regions contain the most functionally important elements of the genome and point to areas where intensified study will lead to a better understanding of how the genome works. Since only 1.5 percent of the genome is coding, if this rough estimate holds up, it would imply that there is roughly an additional 3.5 percent of the genome that is functionally important noncoding DNA. Our goal over the next several years is to characterize these regions computationally and in many cases also functionally, through wet-lab experiments.
Further study of the portion of the human genome that is under purifying selection has led to the identification of 481 "ultraconserved" regions of 200 or more DNA bases that are completely identical in the human, mouse, and rat genomes. The probability of finding even one such element in the 2.9 billion bases of the human genome is almost zero under a standard model of neutral evolution, where every base is equally likely to undergo independent change. Nearly all of the unchanged regions were also found to be almost unchanged in the dog and chicken genomes, and two-thirds of them were found in the fish genome. But the noncoding ones cannot be traced beyond the fish to sea squirt, fly, or worm. These 481 ultraconserved regions most often either overlap genes that are involved in RNA processing or reside in the noncoding portions of genes or near genes that are involved in regulating gene transcription or development. We are exploring the functions of these elements. If indeed the conservation in these elements is due to purifying selection, then it is a mystery what molecular mechanisms would induce continuous conservation over hundreds of bases.
In an attempt to build realistic and information-rich mathematical models of molecular evolution, we have undertaken larger, multispecies comparisons. Some of these models are tailored to specific kinds of functional elements, such as coding exons (in conjunction with the NIH Mammalian Gene Collection project) and transcription factorbinding sites (in conjunction with the National Human Genome Research Institute ENCODE project). These models should identify elements under purifying selection with higher sensitivity and specificity than was possible with two-species comparisons. Ultimately we hope to explore the full spectrum of events in mammalian molecular evolution, including insertions, deletions, duplications, inversions, and rearrangements. As the number of genomes grows, our goal is to produce increasingly accurate analyses of the evolutionary history of each base in the human genome as a basis for genome-wide functional analysis.
Recent work has revealed some unexpected origins for some ultraconserved elements. Multiple close copies of one of these critical DNA sequences in our genome can be traced to our common ancestor with the coelacanth, a descendant of the ancient marine organism that gave rise to the terrestrial vertebrates more than 360 million years ago. These sequences appear to derive from DNA elements known as retroposons. In the coelacanth, the segments were produced by a retroposon known as a short interspersed repetitive element, or SINE, which is a piece of DNA that can make copies of itself and insert those copies elsewhere in an organism's genome. Wet-lab tests in mouse embryos have confirmed that one of these segments regulates a nearby neurodevelopmental gene. Thus, the movement of retroposons can generate evolutionary experiments by adding new regulatory modules to genes, and for as yet unknown reasons, these can occasionally become ultraconserved.
We have also begun to explore sudden change in noncoding regions of the genome that have previously been highly conserved by purifying selection. Comparing our genome to that of our closest relative, the chimpanzee, we found the most dramatic example of evolutionary acceleration in a novel RNA gene that is expressed specifically in neurons in the developing human neocortex during a critical period for cortical neuron specification and migration. This and other regions of accelerated change in the human genome provide exciting new candidates in the search for uniquely human biology.
This work is funded in part by grants from the National Human Genome Research Institute, the National Cancer Institute, the National Science Foundation, and the California Institute for Quantitative Biomedical Research (QB3).
Last updated: September 11, 2006
|
 |
|
 |