Millions of species of living organisms on the planet possess billions of different proteins. This enormous diversity has evolved from a limited number of ancestral proteins, probably about a thousand. An expansion of more than 6 orders of magnitude in protein numbers has produced rich material for studying the laws of evolution.
From a pragmatic perspective, evolutionary links between proteins offer shortcuts to gain knowledge about homologs from a few experimentally characterized representatives. A homology link detected from sequences is the most powerful source of structure prediction, often leads to functional insights, and can guide experimental design.
From a theoretical perspective, we would like to understand how biological diversity is generated. Amazingly, the same themes and motifs are recurrently used, elaborately modified, and combined in evolution to produce functional entities. Our goal is to uncover these prevalent mechanisms in protein evolution.
Recent advances in obtaining sequence and structure information (~60,000,000 nonidentical sequences and ~110,000 spatial structures) make for productive computational analysis. A grand step toward comprehending the protein universe would be the classification of sequence-structure data into an evolutionarily relevant hierarchical system. When two proteins display clear similarity, the task is straightforward. When similarity is low, however, discrimination between evolutionarily meaningful and spurious relationships becomes highly nontrivial. Our group finds and interprets distant links between proteins. Because no available method deals well with this difficult problem, we develop new mathematical approaches to explore protein sequence-structure data, and because no single narrow approach is able to find remote homologs, we combine sequence and structure analysis with evolutionary considerations to facilitate discoveries of biological significance. The following highlights illustrate the scope of our projects
Evolutionary Classification of Protein Domains
For years, we have been developing computational methods for the analysis of distantly related proteins. This work has culminated in an evolutionary classification of all domains (ECOD) with experimentally determined spatial structures. ECOD is available to scientists through a website. Currently, more than 300,000 protein chains from all 110,000 available PDB (protein data bank) structures are partitioned into 400,000 domains that are classified into 11,000 protein families grouped into 3,000 homologous superfamilies. ECOD differs from similar resources (e.g., SCOP [structural classification of proteins] and CATH [class, architecture, topology, and homology]) largely in four aspects. First, it covers very distant evolutionary connections between domains. This information is not available elsewhere and is the focus of our research. Second, instead of three-dimensional (3D) structure, ECOD focuses on evolution and thus lacks a fold level. Third, ECOD is updated every week, and all proteins with newly determined 3D structures are classified. Finally, ECOD is better integrated with existing sequence classification databases; e.g., ECOD families mirror families in the Pfam (Protein Families) database when possible. The ability to keep up with weekly updates is nontrivial and is achieved by a combination of robust software and a manual curation pipeline that receives domains that fail confident automatic assignment. Every week, a dozen domains are inspected by experts in the lab to render classification decisions. This pipeline is sustainable and beneficial for other research projects.
Toward Prediction of Phenotype From Genotype
Predicting phenotype from genotype represents the epitome of biology. For bacteria, this is a well-studied problem. For animals, it is largely unexplored territory. We can envision a computer program that, given a genome sequence, outputs a 3D model of the animal encoded by this genome. Similar to the protein-folding problem in its stated simplicity, this organismal 3D structure puzzle is exceedingly more challenging. Nevertheless, comparative genomics of suitable model organisms promises groundbreaking discoveries. We started with fungi. Stachybotrys molds produce diverse toxins that affect human health. To figure out genetic basis of toxin synthesis, we selected four strains producing different toxins and sequenced their complete genomes. We compared the genomes and found several strain-specific gene clusters and proposed a unified biochemical model for Stachybotrys toxin production.
Moving on to animals, we sequenced the complete genome of the Eastern Tiger Swallowtail butterfly with our labor-efficient and cost-effective protocol. The cost per new genome falls below $4,000, making insect sequencing projects feasible even for college students. Comparative analysis suggested molecular bases of various phenotypic traits, including terpene production in the swallowtail-specific organ osmeterium. Four key circadian clock proteins are enriched in interspecies mutations and are likely responsible for the pupal diapause difference in swallowtail species. Next, in collaboration with the Magalhães group (University of Liverpool), we obtained and annotated the complete genome of the bowhead whale, which is the longest-living mammal. When we correlated variation in proteins with life span, we found a prominent cluster of longevity-linked positions in the kinase domain of anti-Müllerian hormone type-2 receptor, which inhibits ovarian follicles. This cluster is near a SNP (single-nucleotide polymorphism) associated with delayed human menopause and may function to regulate kinase activity in a life span–specific manner.
Algorithms to Search for Distantly Related Proteins
The BLAST family of programs revolutionized computational biology. Protein BLAST is based on sequence. PSI-BLAST searches a database of sequences with a query alignment. Alignments reveal residue conservation patterns in a family and thus find more distant homologs. A few years back, our group developed COMPASS to search a database of alignments with an alignment. Taking into account conservation patterns in both query and hit, COMPASS can identify weak but homologous similarities. However, many remote evolutionary connections between proteins still remain undetectable.
To push the boundary of sequence-based homology inference further, we recently introduced several innovations that resulted in a dramatic improvement of the algorithm. Traditionally, sequence similarity search has been using the negatives (i.e., unrelated sequences from which to construct random models). Exploring the other side, we demonstrated that the search can be boosted by considering the positives: i.e., known homology relationships in a database of sequence profiles. Similar strategies have been widely used by most successful search engines, such as Google. Our new method, COMPADRE, assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query's and hit's known homologs. This approach increases our precision rate from 18 percent to 83 percent at half coverage of all database homologs. This improvement allows detection of a large fraction of new protein-structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities.
Discovery and Predictive Analysis Protein Families
Krüppel-like factors (KLFs) are a diverse family of zinc finger transcription factors with important roles in many processes, including differentiation and development. Humans possess 17 KLF genes (KLF1–KLF17). We used sequence similarity searches and gene synteny analysis to identify a new putative KLF gene, which we named KLF18. It is present in most placental mammals. KLF18 is a chromosomal neighbor of the KLF17 gene and may be a product of its duplication. Predicted KLF18 proteins are conserved in zinc finger regions and have unique sequence repeats at the N terminus. No expression has been reported for KLF18, suggesting that it either has highly specialized functions or could have become a pseudogene. Besides KLF18, we identified several KLF18-like genes with expression in early embryonic development, suggesting that some of these new proteins are indeed functional.
Frizzled and Smoothened are homologous seven-transmembrane receptors functioning in the Wnt and Hedgehog signaling pathways, respectively. They harbor an extracellular cysteine-rich domain (FZ-CRD) found in a number of other metazoan proteins and Frizzled-like proteins in Dictyostelium. Through sensitive sequence and 3D structure similarity searches, we discovered the FZ-CRD domain in plants and Chromalveolata, revealing its much wider phylogenetic distribution. Furthermore, we found distant relatives of FZ-CRD in a number of proteins, e.g., in glypicans, which are important morphogen-binding proteoglycans. These findings reinforce the evolutionary ties between the Wnt and Hedgehog signaling pathways and underscore the importance of gene duplications followed by diversification in generating essential signaling components in metazoan evolution.
Our group is collaborating with more than 50 labs on various biological problems. For instance, in collaboration with Michael Brown and Joseph Goldstein (both at University of Texas Southwestern at Dallas), we identified the gene for ghrelin O-acyltransferase (GOAT). Ghrelin is a 28–amino acid, appetite-stimulating peptide hormone secreted by the food-deprived stomach. Serine-3 of ghrelin is acylated with an eight-carbon fatty acid, octanoate, which is required for its endocrine actions. However, the enzyme that catalyzes this essential reaction remained unknown and difficult to find. Today, few enzymatic activities linked to medically important biological processes are still "orphans." We hypothesized that GOAT might be an enzyme from the MBOAT family, which consists of membrane proteins with similar activities. Computational analysis found 16 members of the MBOAT family in the mouse genome: their genes were cloned, their proteins expressed, and their activities tested. Only 1 of the 16 proteins displayed GOAT activity, and the needed enzyme was discovered. Identification of GOAT will facilitate the search for its inhibitors that reduce appetite and diminish obesity in humans.
Grants from the National Institutes of Health, the Welch Foundation, and the Citrus Research and Development Foundation provide partial support for these projects.
As of May 5, 2015