Genome sequencing is a fundamental tool for many areas of biological research, ranging from protein engineering to diagnostics for genetic diseases. However, DNA sequence alone is as uninformative as encrypted code. To extract useful information from genomic DNA, we must "decode" the sequence.
As a computational biologist, I develop software applications for interpreting genome sequences. I use probabilistic models that originated in the field of computational linguistics, such as hidden Markov models and stochastic context-free grammars. Much of my work has used these methods to identify and characterize structural RNA genes.
RNA genes are genes that do not make messenger RNAs that code for proteins, but instead make RNAs that function directly as RNA. Since the beginnings of molecular biology, it was known that RNA genes produce ribosomal RNA (rRNA) and transfer RNA (tRNA), fundamental components of the protein translation machinery. Nonetheless, these were thought to be exceptions to the "central dogma'' that RNA's main role in cells is to encode functional proteins. The view of RNA's role broadened in the 1980s when the first catalytic RNAs were discovered, including group I introns and RNase P RNA, showing that RNAs could play not just structural roles but also act as enzymes, which previously were thought to be exclusively protein-based.
By the 1990s, many kinds of functional RNAs had been discovered, including two large families of small nucleolar RNAs (snoRNAs) that direct specific modifications of rRNA. It was surprising to continue to identify previously unsuspected RNA genes, and even whole families of them. The discovery of the two large snoRNA families, combined with the advent of genome-sequencing technology in the late 1990s, fueled the development of systematic methods for searching for more potential RNA genes. In 2001, an entirely new gene family was discovered, the microRNAs (miRNAs). This appears to be the largest of all RNA gene families, with hundreds or thousands of genes in a typical animal genome. Until this discovery, only two miRNA genes had been known, and they were thought to be oddities of the nematode worm, in which they had been found. Now it is clear that miRNAs are a key regulatory gene family in many eukaryotes.
How had so many RNA genes remained undiscovered, despite having crucial roles in cellular function? The short answer is that we were not looking for them. Gene discovery methods were biased toward "typical'' genes that encode proteins. Experimental and computational methods both needed to be adapted to search specifically for functional RNA genes.
What can we look for in a DNA sequence to discover an RNA gene? A prominent signal in many (but not all) known RNA genes is that the RNA molecule folds and base-pairs to itself, forming small helical structures similar to those of the DNA double helix. Computational modeling of this "secondary structure'' requires treating the DNA sequence not just as a linear string of letters, but at least as a two-dimensional pattern of base-pairing interactions. Probabilistic models called "stochastic context-free grammars'' (SCFGs), which are ideally suited for this problem, were introduced into computational biology from computational linguistics in the 1990s for RNA analysis.
Using SCFGs, I developed one of the first RNA gene–finding programs. This program has been used to search for novel RNA genes in several genomes, including the bacteria Escherichia coli and Sinorhizobiummeliloti, the archaeon Pyrococcus furiosus, the yeast Saccharomycescerevisiae, and the nematode worm Caenorhabditis elegans. These are all relatively small genomes. The main drawback of the program (and other related computational tools that have since appeared) is that it predicts too many false positives, and in large genomes (including the human genome) it becomes difficult to decide which predictions are real functional RNAs and which are false predictions. Many people believe that high-throughput experimental methods for RNA gene discovery have significant false-positive problems as well. A great current question and controversy in the field is how many RNA genes there are in the human genome. Some published predictions are as high as tens of thousands of RNAs. If true, this would rival or exceed the number of protein-coding genes thought to be present in humans. It is too soon to draw conclusions. I plan to approach the problem by implementing better computational models and reducing the number of erroneously predicted RNAs.
Genomes also give us insight into the evolution of life. Comparing genomes of different species can allow us to infer a history of the evolutionary changes that occurred since those species diverged from their last common ancestor. Using phylogenetic inference methods and molecular sequence data, we now appreciate the broad outlines of a tree of life relating all species, including eukaryotes (protists, plants, fungi, animals), bacteria, and archaea. However, many details of evolutionary relationships remain undetermined.
Phylogenetic inference methods have been developed over many years. Probabilistic models of evolution are generally thought to give the most powerful and reliable estimators of phylogenetic relationships. A fundamental technical problem of all these methods is that they are best at modeling residue substitutions, but generally treat other evolutionary events such as insertions and deletions with less rigor, if at all. But insertion and deletion events are common; genes and genomes show great variability in sequence lengths.
I am currently working on probabilistic evolutionary models that include insertions and deletions, as well as substitutions. An important property of my models is that they give inference algorithms that require little more computational time than the current methods that analyze only substitutions.
In the future, I will continue to use probabilistic modeling to attack problems in genome sequence analysis. At Janelia, I am in the fortunate position of being able to apply computational methods to neurobiology.
Some of the work described here has been supported by funding from the National Institutes of Health.