Our goal is to understand biological evolution at scales ranging from individual molecules to whole ecosystems. We use a combination of techniques drawn from fields ranging from computer science to molecular biology to understand the evolution, structure, and function of the human microbiome (the microbes that inhabit each of our bodies) and, at a more fundamental level, the evolution of biochemical functions in random-sequence pools of RNA molecules.
Community Composition and the Human Microbiome
We are developing new methods to understand why particular groups of organisms live in particular environments, and how these organisms interact with one another to form functional assemblages. For example, communities in the soil might be driven by nitrate, phosphate, average temperature, moisture content, or pH (our studies with Noah Fierer [University of Colorado at Boulder] suggest that pH is by far the most important of these factors). The Human Microbiome Project, of which my lab is a part, seeks to bring these studies closer to home and to understand the vast number of microbial symbionts that live on and inside each of our bodies. These microbes may outnumber our own cells as much as 10 to 1, contribute many metabolic functions we would otherwise lack, and likely pervade many aspects of our lives. We are all more than 99 percent identical at the level of our human DNA, but we may be 80–90 percent different from one another in terms of the microbes that inhabit us. It therefore makes sense to look where the variation actually is for factors associated with health and disease.
To explore these microbial communities, it helps to take an explicitly evolutionary perspective that exploits similarities and differences among different groups of organisms. We recently developed UniFrac, a clustering metric that uses a phylogenetic tree to measure the biological distance between each pair of environments represented in the tree. We can then use clustering methods, such as hierarchical clustering, and ordination methods, such as principal coordinates analysis, to identify environments that are more similar or different and to correlate these differences with physical and biological properties of the environment. For example, we have used UniFrac to discover that salinity is the main driving factor in a broad range of distinct physical habitats, that mammalian gut communities cluster primarily by diet, and that the gut is a much more distinct habitat from different physical habitats than these physical habitats are from one another. For example, the difference between the gut and the mouth communities from the same person can be larger than the differences in the communities living in a hot spring and on an ice cap. UniFrac is already having a wide impact in a range of environmental and medical applications.
One key challenge with published microbial community sequence data is that information about the abundance of each species is often not reported. We therefore need high-throughput techniques to collect sequence data in a consistent manner across environments. A key advance we have made in this regard is barcoded pyrosequencing, especially the use of formal error-correcting codes to allow us to use pyrosequencing to study hundreds of microbial communities simultaneously. Barcoded pyrosequencing opens up whole new vistas of unexplored microbial diversity. For example, we have used this technology to study water quality and samples from the lungs of cystic fibrosis patients to show that, if you're typical of our study population, your left and right hand probably share only about 18 percent of their species, and that systematic shifts in the gut community occur with obesity. Remarkably, radically different species assemblages can maintain a core at the level of gene functions, paralleling trends in macroecology where, for example, grasslands on different continents may share none of their species but may look extremely similar in physical and chemical conditions when compared to, say, rain forests.
We are trying to extend these techniques to personalized medicine with application to conditions including obesity, malnutrition, and Crohn's disease. Our goal is to develop a predictive model that will allow us to test the effects of different treatments on an individual patient's microbiota: for example, in our study populations in developing nations for the malnutrition project, a particular child might have a pathogen that blocks nutrient uptake that we could treat, might lack a normal gut symbiont that we could supply, or there might simply be a mismatch between what that child can metabolize and what he or she is eating. Our dream is to use the decreasing cost of DNA sequencing to bring the benefits of personalized medicine to slums and refugee camps. (Collaborators on these projects include Jeffrey Gordon, Ruth Ley, Frederic Bushman, Noah Fierer, Scott Kelley, Norman Pace, Eric Triplett, Allan Konopka, Henry Tufo, Gary Andersen, and Todd DeSantis, among others.)
Although bacteria are very small, they are still far too complex to allow convenient understanding of the most fundamental processes of evolution. At a much lower molecular level, an experimental technique called in vitro selection allows RNA molecules that perform particular biological functions, such as catalyzing a reaction or binding a target, to be isolated from large pools of random RNA sequences. Typically, these pools are designed to have equal compositions of the four nucleotides. However, it may be that functional RNA molecules are most likely to be found in particular regions of sequence space that share the biases often found in biological sequences.
We are therefore comparing RNA molecules isolated from in vitro selection to biological RNA molecules to test whether there are general rules that govern the nucleotide composition of specific RNA structural features. Several researchers have shown that biological RNAs are specifically biased toward purines (the bases A and G, which contain a double ring, as opposed to the bases C and U, which contain a single ring). We are testing whether functional RNA molecules of defined overall composition differ statistically from random molecules of the same composition, and whether there are rules that govern how many of each kind of base in a random sequence end up in different structural categories, such as stems, loops, bulges, and junctions. These patterns are heavily influenced by three-dimensional structural features. We expect that these rules will help us improve RNA secondary structure prediction. We also expect that we will find general rules that influence the assembly of particular RNA architectures.
Related to this project, we are also testing whether the information contained in minimal functional RNA motifs is sufficient, as well as necessary, for function. In vitro selection experiments typically isolate short, degenerate sequences that are necessary for function from many different random-sequence backgrounds. Continuing work in Michael Yarus's lab (University of Colorado at Boulder) has shown that the minimal motif that performs a particular task, such as binding or catalysis, can be found by "squeezing" the random region into shorter and shorter lengths. If these sequences and their specific secondary structure configuration are sufficient for activity, we should be able to obtain functional sequences by embedding them in longer, random sequences. We are determining whether this is the case, or whether additional identity elements are needed. Because we can accurately predict how many random sequences are required to obtain a specified sequence and secondary structure motif, this work is crucial for estimating the information required to perform different catalytic or binding functions. (Collaborators on this project include Michael Yarus, Manuel Lladser, Hans De Sterck, Sandra Smit, and Jana Chocholoušová.)