HomeResearchAlgorithms for Genome Analysis

Our Scientists

Algorithms for Genome Analysis

Research Summary

David Haussler is developing new statistical and algorithmic methods to explore the molecular evolution of the human and other vertebrate genomes, integrating cross-species comparative and high-throughput genomics data to study gene structure, function, and regulation. He applies genome-scale evolutionary analysis to the study of cancer and other diseases.

My genome informatics team has participated in the public consortium efforts to produce, assemble, and annotate the first vertebrate genomes. As collaborators in the Human Genome Project, we built the program that assembled the first working draft of the human genome sequence from information produced by sequencing centers worldwide, and we participated in the informatics associated with the finishing effort. We provide an interactive genome browser for the human, mouse, rat, and other genomes that is used by thousands of biomedical researchers every day (genome.ucsc.edu). By integrating multiple sets of high-throughput genomics data, computational predictions, and curated genomic feature sets from dozens of laboratories, the browser provides a new kind of computational microscope for exploring genomes.

Our work developing and annotating genomes for the browser provides a foundation for our scientific efforts. These are directed at the large-scale discovery and characterization of the functional elements in vertebrate genomes through comparative sequence analysis, the study of vertebrate molecular evolution, and the integration of an increasing variety of high-throughput data sets provided by functional genomics efforts to address human disease and basic questions in biology.

Throughout the approximately 75 million years since the human species diverged from its common ancestor with the rat and mouse, the three genomes have independently accumulated many changes, leading to the three different species we see today. We participated in the first effort to reconstruct these changes by computational analysis. In comparisons of the human, mouse, and rat genomes, we found that the rate of neutral substitution varies regionally along the chromosomes. The mechanistic explanation of this variation has not yet been found. We determined that a core of about 40 percent of the human, rat, and mouse genome sequences derives from a common ancestor, and we produced base-level alignments between the three genomes in these regions. This alignment, combined with characterization of neutral substitution rates, led to the estimate that at least 5 percent of the human genome exhibits an evolutionary pattern of negative selection; changes to the bases in these regions usually reduce fitness, and hence seldom become established in the population.

We suspect that these conserved regions contain the most functionally important elements of the genome and point to areas where intensified study will lead to a better understanding of how the genome works. Since only 1.5 percent of the genome is coding, if this rough estimate continues to hold up, it would imply that at least an additional 3.5 percent of the genome—the noncoding portion of the 5 percent of DNA under negative selection— has been functionally important for at least 75 million years. Some of these noncoding regions are "ultraconserved," showing almost no change for hundreds of millions of years. We have confirmed that negative selection is three times stronger in these regions than it is for nonsynonymous changes in coding regions. It is a mystery what molecular mechanisms would place virtually every base in segments up to 1 kilobase long under this level of negative selection. Comparing our genome to that of our closest relative, the chimpanzee, we found dramatic examples of regions that exhibit strong negative selection over more than 100 million years of evolution but have accumulated many changes in our lineage just in the last 5 million years since we diverged from our common ancestor with the chimpanzees. One is a novel RNA gene that is expressed specifically in neurons in the developing human neocortex during a critical period for cortical neuron specification and migration. Some of the human-specific changes found in this study may be among the many thousands of genomic events that helped define us as a species.

To unveil the evolutionary history of the vertebrate species, Stephen O’Brien (Laboratory of Genomic Diversity, National Cancer Institute), Oliver Ryder (Institute for Conservation Research, San Diego Zoo), and I, along with a large group of scientists, recently proposed a large-scale Genome 10K Project to sequence the genomes of at least 10,000 species. If funded, this project would provide an unprecedented foundation for the study of vertebrate evolution and its relationship to vertebrate biology, human and animal health, and conservation. In support of this project, my group is developing new mathematical and computational models of genome evolution, including insertions, deletions, duplications, inversions, rearrangements, and substitutions. As the number of sequenced genomes grows, our goal is to produce increasingly accurate analyses of the evolutionary history of each base in the human and other vertebrate genomes as a basis for genome-wide functional analysis.

Our comparative genomics work has revealed some unexpected origins for some ultraconserved elements. Multiple close copies of one of these critical DNA sequences in our genome can be traced to our common ancestor with the coelacanth, a descendant of the ancient marine organism that gave rise to the terrestrial vertebrates more than 360 million years ago. These sequences appear to derive from DNA elements known as retroposons, which are evolutionarily derived from retroviruses. In the coelacanth, the segments were produced by a retroposon known as a short interspersed repetitive element, or SINE, which is a piece of DNA that can make copies of itself and insert those copies elsewhere in an organism's genome. Wet-lab tests have confirmed that one of these segments regulates a nearby neurodevelopmental gene. Thus, the movement of retroposons can generate evolutionary experiments by adding new regulatory modules to genes and, for as yet unknown reasons, these can occasionally become highly conserved. We recently found similar examples from examination of other rare and endangered species.

Genomic changes also occur during the development of cancer and may be studied using the same mathematical and computational methods that we use to compare genomes between closely related species. We participate in two large consortia using genome-scale analysis to better understand cancer: the Cancer Genome Atlas and the Stand Up To Cancer projects. Our computational analysis of tumor genomes reveals pathways of genetic interactions that are repeatedly found to be altered by genetic changes in particular subtypes of cancer and that may provide cancer experts new avenues for treating these cancer subtypes.

This work is funded in part by grants from the National Human Genome Research Institute, the National Cancer Institute, the American Association for Cancer Research, the National Institute on Drug Abuse, and the California Institute for Quantitative Biomedical Research (QB3).

As of December 29, 2009

Scientist Profile

University of California, Santa Cruz
Computational Biology, Molecular Biology