The central problem facing modern human genetics is to make sense of the vast sea of human genetic variation. Which of the ~10 million common SNPs and thousands of genomic regions with deletions or duplications contribute to common diseases and other phenotypes of interest? What role do rare variants play in the etiology of disease? How do we determine the functional variants that underlie observed disease associations? When we know a person's genome sequence, can we predict the diseases for which he/she is at risk?
My group focuses on understanding the nature of human genetic variation. We have made contributions in a number of areas, including (1) understanding the history and structure of human populations, (2) developing methods for identifying genetic variants that contribute to common diseases, and (3) quantifying the distribution and extent of genetic variation and human adaptation. Recently we have also become interested in the population genetics of gene expression, and we are now focusing a substantial part of our effort in this area.
One main line of work has been on developing and applying methods that use genetic data to learn about the structure of human populations. My colleagues and I have developed a statistical clustering approach, implemented in a free program called structure, that uses genetic information to identify groups of related individuals. The method uses a Bayesian clustering model that—using the genetic information—simultaneously detects the presence of distinct population clusters and estimates the membership of each individual in each cluster. The approach has proved to be valuable for understanding how genetic variation is distributed across the range of a species. For example, we have used this method to show that at the highest level, human populations correspond closely to traditional continental regions, with barriers to gene flow, such as the Himalayas, creating boundaries between the major human populations. Our structure software is now widely used in a range of fields—for instance, in human genetics and anthropology, in studies of pathogen evolution, in forensics, in molecular ecology, and in conservation genetics.
An important application of this type of method arises in the context of mapping disease genes. One potentially powerful study design, "case-control association," fell out of favor in the 1990s due to concern that unrecognized population structure could lead to false positives. Our work helped to show that these problems can be greatly reduced by using random unlinked marker loci to detect and correct for cryptic population structure. The case-control design is now the most widely used approach in the field, and virtually all of these studies now use some sort of test for population structure based on the genetic data (now most often based on principal component approaches).
A third major area that we have studied has been the nature of human genetic variation. This includes work describing the amount of linkage disequilibrium in human populations, and one of the early papers showing that large deletions are surprisingly widespread in the human genome. We have been particularly interested in the extent of recent human adaptation and the identification of the genomic regions (and ultimately which variants) that have been targets of selection. We have shown that even during the last ~10,000 years, humans have been adapting to a variety of environmental pressures. Our latest work shows, however, that—with the exception of a few key genes—most human adaptation proceeds relatively slowly and human populations are generally not microadapted to very local conditions.
Finally, we have recently become interested in the population genetics of gene expression variation. While there is already quite accurate annotation of human protein-coding genes, knowledge of which parts of the genome encode regulatory sequences (and what they do) lags far behind. Other groups previously reported that a sizable fraction of genes harbor quantitative trait loci (QTL) that affect gene expression levels. We have recently shown that these types of data (genome-wide genotype and expression data on many individuals) can be used to provide a relatively unbiased profile of the types of locations within genes that can affect gene expression levels. In particular, we find there are two major peaks of QTL density: one is centered symmetrically around the transcription start site and the second is immediately upstream of the transcription end site. In the longer term, the expression QTL-mapping approach is a powerful tool for linking the fields of gene regulation and population genetics.