 |
Computational Genome Analysis of Proteins and Noncoding RNAs

Summary: Sean Eddy is interested in identifying genes that make functional RNAs instead of encoding proteins.
All history was a palimpsest, scraped clean and re-inscribed exactly as often as necessary.
—George Orwell
A palimpsest is a text that has been incompletely erased and overwritten, possibly multiple times. (Paper used to be precious.) Careful analysis of a palimpsest reveals the shadows of old texts, illuminating the history of the document. A DNA sequence is a genetic palimpsest that has been overwritten many times by evolution. I use computational sequence analysis to infer the structures, functions, and histories of genes from their sequences in modern genomes. Genome sequences preserve a record that can reveal the deepest history of evolution—perhaps the origins of life on the planet, but at least the origin of the last common ancestor of modern life forms. I am especially interested in testing the theory that organisms based on a biochemistry of catalytic RNAs preceded modern protein- and DNA-based life, and in whether any remnants of this "RNA world" persist in modern genomes.
The Human Genome Project and other large DNA-sequencing efforts aim to deliver complete DNA sequences from human as well as model genetic organisms, pathogenic microbes, crop plants, and livestock, but by itself, a raw genome sequence is almost useless. Interpretation of a genome sequence depends on computational analysis to discover genes and to infer the biochemical functions of those genes by recognizing similarities to known sequences or structures. Geneticists rely on two database search programs, BLAST and FASTA. Only about half of human genes show a significant BLAST or FASTA database similarity, however, and promoters, noncoding RNA genes, and various other important signals often cannot be detected. We need additional sequence analysis methods if we are to leverage the public and private investment in genomes. My work therefore involves making significant technological contributions to the way that DNA sequences are analyzed.
Identifying Noncoding RNAs in Genome Sequences According to the RNA world hypothesis, RNA catalysts and replicators preceded modern protein/DNA machines. This hypothesis arose from the discovery of catalytic RNAs and also from the fact that functional RNAs are used instead of protein enzymes in some ancient, highly conserved roles in modern organisms. Some proponents of the RNA world hypothesis view extant functional RNAs as "molecular fossils" of the RNA world. How many genes encode functional RNA rather than protein? What are their functions? How many are evolutionarily ancient? The answers to these questions promise to shed some light on the origin of life.
Because noncoding RNA (ncRNA) genes tend to be small and are inherently immune to frameshift or nonsense mutations, they are hard to find by classical mutational genetic screens. They are also difficult to recognize in genomes because they do not have open reading frames and thus cannot be discovered by gene-finding programs. Their sequences also tend to evolve rapidly, conserving their structure more than their primary sequence, making them difficult to discover by standard database searches.
The availability of genome sequence data suggested that it might be possible to identify novel ncRNA genes systematically by computational screens, but only if new similarity search and gene identification algorithms could be developed that were capable of dealing with RNA's secondary structure features. I developed such algorithms, based on Bayesian probabilistic models called "stochastic context-free grammars."
We have used these algorithmic approaches to develop a program that identifies ncRNA genes by taking advantage of comparative genome sequence analysis—that is, by comparing the DNA sequences of related organisms such as human and mouse. The pattern of mutations we observe in a human sequence compared to the related mouse sequence tells us something about the function of the sequence. Basically, we construct three statistical models describing the pattern of mutation we expect to see in RNA genes, protein genes, and other conserved sequences, and we test each conserved genomic region for the model it seems to fit best. Our first large-scale test of this approach was done in the small genome of the bacterium Escherichia coli, where our program has predicted a few hundred new RNA genes. We have confirmed some of these genes experimentally, and we are now also conducting computational screens for new ncRNA genes in other organisms, including human, nematode, and yeast.
Our results, and results from several other labs, seem to indicate that ncRNA genes are more prevalent than even we thought. For those new genes where we have some indication of their function, most appear to be functioning as highly adapted regulatory molecules, which is not consistent with the idea that they are ancient molecular fossils of the RNA world. It is beginning to appear that ncRNAs could be a large class of genes that have been overlooked because it is so difficult to identify them.
Hidden Markov Models and the Pfam Database We are also developing software to support high-throughput bioinformatics analysis. One of our software packages, HMMER, allows researchers to search protein sequence databases with a statistical description of the conserved sequences in a given protein domain query. HMMER, which is now installed at more than a thousand sites worldwide, made it possible to develop a large database of multiple sequence alignments of protein domain families called Pfam (for "protein families"). I continue to help develop and maintain Pfam, which currently contains multiple alignments and models for more than 7,500 different protein domain families.
Probabilistic Models for Biological Sequence Analysis Our methods rely on mathematical models of biological sequences. These models are rooted in the fields of Bayesian probabilistic modeling theory and in formal grammar theory from computational linguistics. We use Bayesian models as a framework for implementing searches that may need to take into account RNA secondary structure, position-specific probabilities of conservation or insertion/deletion, and phylogenetic relationship.
This research was supported by funding from the National Human Genome Research Institute, the W.M. Keck Foundation, and Alvin Goldfarb.
Last updated: January 31, 2007
|
 |
|
 |