All history was a palimpsest, scraped clean and re-inscribed exactly as often as necessary.
A palimpsest is a text that has been incompletely erased and overwritten, possibly multiple times. (Paper used to be precious.) Careful analysis of a palimpsest reveals the shadows of old texts, illuminating the history of the document. A DNA sequence is a genetic palimpsest that has been overwritten many times by evolution. I use computational sequence analysis to infer the structures, functions, and histories of genes from their sequences in modern genomes. Genome sequences preserve a record that can reveal the deepest history of evolution—perhaps the origins of life on the planet, but at least the origin of the last common ancestor of modern life forms. I am especially interested in testing the theory that organisms based on a biochemistry of catalytic RNAs preceded modern protein- and DNA-based life, and in whether any remnants of this "RNA world" persist in modern genomes.
The Human Genome Project and other large DNA-sequencing efforts aim to deliver complete DNA sequences from human as well as model genetic organisms, pathogenic microbes, crop plants, and livestock, but by itself, a raw genome sequence is almost useless. Interpretation of genome sequences depends on computational analysis to discover genes and to infer the biochemical functions of those genes by recognizing similarities to known sequences or structures. To recognize the subtle shadows of ancient ancestry in these multi-billion-year-old genetic palimpsests, we make use of sophisticated probabilistic inference methods. My work aims to make fundamental technological contributions to the way that DNA sequences are analyzed and to deliver these advances in robust, well-engineered software tools.
A New Generation of Homology Search Tools
The main tool that geneticists use to recognize evolutionarily related sequences is a program called BLAST, first introduced in the early 1990s. BLAST is a fundamental tool in the field—molecular biology's Google. But the algorithms and mathematics that underlie sequence homology recognition went through a major revolution in the 1990s with the advent of probabilistic inference methods, particularly a class of methods called hidden Markov models (HMMs), which provide a mathematical framework that formalizes the problem of distant sequence homology recognition. Over the past decade, many HMM-based approaches have been developed for sequence analysis, including a software package called HMMER from my lab. Despite these advances in mathematical underpinnings, the BLAST software package remains the workhorse of the field, largely because BLAST is about 100-fold faster than the fastest implementations of the newer and supposedly better HMM-based methods. At Janelia, we have launched a major effort to engineer software that delivers the power of HMM-based methods, while running at or above BLAST speed. Our aim is to bring about a generational change in the most important tool of molecular sequence analysis.
Identifying Noncoding RNAs in Genome Sequences
According to the RNA world hypothesis, RNA catalysts and replicators preceded modern protein/DNA machines. This hypothesis arose from the discovery of catalytic RNAs and also from the fact that functional RNAs are used instead of protein enzymes in some ancient, highly conserved roles in modern organisms. Some proponents of the RNA world hypothesis view extant functional RNAs as "molecular fossils" of the RNA world. How many genes encode functional RNA rather than protein? What are their functions? How many are evolutionarily ancient? The answers to these questions might shed some light on the origin of life.
Because noncoding RNA (ncRNA) genes tend to be small and are inherently immune to frameshift or nonsense mutations, they are hard to find by classical mutational genetic screens. They are also difficult to recognize in genomes because they do not have open reading frames and thus cannot be discovered by gene-finding programs. Their sequences also tend to evolve rapidly, conserving their structure more than their primary sequence, making them difficult to discover by standard database searches.
The availability of genome sequence data suggested that it might be possible to identify novel ncRNA genes systematically by computational screens, but only if new similarity search and gene identification algorithms could be developed that were capable of dealing with RNA's secondary structure features. I and others have developed such algorithms, based on Bayesian probabilistic models called "stochastic context-free grammars."
We have used these algorithmic approaches to develop a program that identifies ncRNA genes by taking advantage of comparative genome sequence analysis—that is, by comparing the DNA sequences of related organisms such as human and mouse. The pattern of mutations we observe in a human sequence compared to the related mouse sequence tells us something about the function of the sequence. Basically, we construct three statistical models describing the pattern of mutation we expect to see in RNA genes, protein genes, and other conserved sequences, and we test each conserved genomic region for the model it seems to fit best. Our first large-scale test of this approach was done in the small genome of the bacterium Escherichia coli, where our program has predicted a few hundred new RNA genes. We have also conducted computational screens for new ncRNA genes in other organisms, including well-known organisms such as humans, nematodes, and yeast, and in less-well-known organisms such as the deep-sea-vent extremophile Pyrococcus and the pond ciliate Oxytricha, whose genomes have unusual properties that allowed us to conduct particularly simple screens for new ncRNAs.
For those new genes where we have some indication of their function, most appear to be functioning as highly adapted regulatory molecules, which is not consistent with the idea that ncRNAs are ancient molecular fossils of the RNA world. I have argued instead for a "modern RNA world" view, where functional RNA is still actively deployed by evolution in roles where RNA is better suited than protein, such as sequence-specific recognition of other RNAs.
As of June 16, 2010