Millions of species of living organisms on the planet possess billions of different proteins. This enormous diversity has evolved from a limited number of ancestral proteins, probably about a thousand. An expansion of more than 6 orders of magnitude in protein numbers has produced rich material for studying the laws of evolution.
From a pragmatic perspective, evolutionary links between proteins offer shortcuts to gain knowledge about homologs from a few experimentally characterized representatives. A homology link detected from sequences is the most powerful source of structure prediction, often leads to functional insights, and can guide experimental design.
From a theoretical perspective, we would like to understand how biological diversity is generated. Amazingly, the same themes and motifs are recurrently used, elaborately modified, and combined in evolution to produce functional entities. Our goal is to uncover these prevalent mechanisms in protein evolution.
Recent advances in obtaining sequence and structure information (~10,000,000 nonidentical sequences and ~60,000 different spatial structures) make for productive computational analysis. A grand step toward comprehending the protein universe would be the classification of sequence-structure data into an evolutionarily relevant hierarchical system. When two proteins display clear similarity, the task is straightforward. When similarity is low, however, discrimination between evolutionarily meaningful and spurious relationships becomes highly nontrivial. Our group finds and interprets distant links between proteins. Since no available method deals well with this difficult problem, we develop new mathematical approaches to explore protein sequence-structure data, and since no single narrow approach is able to find remote homologs, we combine sequence and structure analysis with evolutionary considerations to facilitate discoveries of biological significance.
Our goal is to improve computational methods to assess protein similarity. The focus or our research is on finding distant evolutionary connections and using them to study mechanisms of protein evolution and to make predictions about spatial structure, biological function, and relevance to disease. Since the majority of structurally similar protein pairs share low sequence identity (Figure 1, maximum at around 14 percent), our ability to detect them from sequence is important.
During the past several years, we improved two essential steps in sequence analysis: database similarity search and multiple alignment.
Protein database search. The BLAST family of programs revolutionized computational biology. Protein BLAST is based on sequence. PSI-BLAST searches a database of sequences with a query alignment. Alignments reveal residue conservation patterns in a family and thus find more distant homologs. A few years back, our group developed COMPASS to search a database of alignments with an alignment. Taking into account conservation patterns in both query and hit, COMPASS can identify weak but homologous similarities. However, many remote evolutionary connections between proteins still remain undetectable. To push the boundary of sequence-based homology inference further, we recently improved statistical treatment of the results. First, we suggested a more realistic way to generate random reference alignments by shuffling secondary structural elements, instead of shuffling positions. Second, we proposed a new distribution to approximate the scores for comparisons of random alignments. Most importantly, in addition to position-specific amino acid frequencies, we incorporated other properties into scoring. This informationsuch as predicted secondary structure, positional conservation, clustering of sequence motifscombined with sequence profiles yielded more accurate ranking of hits. The resulting tool (PROCAIN) is more powerful than its competitors.
Multiple sequence alignment. After homologous sequences are found, they need to be aligned. Multiple alignment of protein sequences (MSA) is considered to be the second (after folding) most important problem in computational biology. Since alignments of proteins with unknown three-dimensional structure to proteins with determined structure are used for structure prediction and homology modeling, better MSA software results in better structure prediction. Traditionally, ClustalW is used for sequence alignments. ClustalW is relatively fast and for close sequences (above 30 percent identity) produces accurate alignments. However, the majority of protein families contain weakly similar sequences. For these remote homologs, ClustalW-like programs align only about 15 percent of amino acids correctly. We developed PROMALS3D, a multiple protein sequence and structure alignment program that is the most accurate aligner to date. For homologs with low sequence identity (~10 percent), PROMALS3D is about 3 times more accurate than ClustalW. The power of sequence alignment in PROMALS3D comes from three sources: (1) usage of the entire sequence database to build profiles, (2) inclusion of predicted secondary structures to score alignments, and (3) application of state-of-the-art hidden Markov models and consistency function to perform alignment. Moreover, PROMALS3D may be used as a joint sequence and structure aligner. If 3D structures are known, their coordinates are used for structural superposition to deduce alignments. PROMALS3D is freely available as a Web server and for download.
Discrimination between distant homologs and structural analogs. A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures because of a limited number of energetically favorable ways to pack secondary structural elements. Discriminating between homology and analogy is nontrivial, especially when similarity between proteins is low. However, such discrimination is important, even for functional implications, as homologs tend to preserve their function. The most nonstandard feature of our work is the approach to select the datasets of homologs and structural analogs used for training the scoring function. Assembling such datasets is challenging. Homologs should not be biased by sequence similarity, since it is a well-established indicator of homology. Analogs should be as structurally close as possible, but this similarity should not be homologous. An example of what constitutes an analogous domain pair is given in Figure 2. Using these datasets, we optimized a scoring function to distinguish between evolutionary relatives and structurally convergent proteins. This function is a support vector machine (SVM) classifier that uses a dozen scores generated from protein sequences and 3D structures. This SVM correctly finds 90 percent of distant homologs that are manually classified in the SCOP (structural classification of proteins) database.
Application to Biological Problems
To find novel remote homology links between proteins, we have applied various computational methods enhanced by manual analysis to explore sequence and structure databases. Three examples of our projects are a large-scale study of a protein class, an in-depth analysis of a homologous group of proteins, and a collaborative project leading to an interesting discovery.
Classification of disulfide-rich domains. Small proteins are particularly difficult to analyze and classify because their short length results in marginally significant statistics during comparisons. Despite their small size, disulfide-rich proteins include some of the most important eukaryotic growth factors, toxins, enzyme inhibitors, hormones, pheromones, and allergens. Thus, they attract significant medical interest even apart from their fundamental importance. We classified approximately 3,000 small, disulfide-rich protein domains. These domains can be arranged into 41 fold groups on the basis of structural similarity. Our fold groups, which describe broader structural relationships than existing groupings of these domains, bring together representatives with previously unacknowledged similarities; 18 of the 41 fold groups include domains from several SCOP folds. Within the fold groups, the domains are assembled into families of homologs. We define 98 families of disulfide-rich domains, some of which include newly detected homologs, particularly among knottin-like domains. On the basis of this classification, we have examined cases of convergent and divergent evolution of functions performed by disulfide-rich proteins. This study establishes a basic framework for all future analyses of disulfide-rich domains.
Analysis of fic domains. Recently, in collaboration with Kim Orth (University of Texas Southwestern Medical Center at Dallas), we participated in the discovery of the function of the fic domain from Vibrio parahaemolyticus type III secreted effector VopS. A fic domain covalently modifies Rho GTPase threonine with AMP to inhibit downstream signaling events in host cells. The VopS fic domain includes a conserved sequence motif (HPFx[D/E]GN[G/K]R) that contributes to AMPylation. Extending this work, we use sequence and structure-based computational methods to identify fic homologs in doc toxins and the type III effector AvrB. The conserved sequence motif that contributes to AMPylation unites fic with doc (Figure 3). Although AvrB lacks this motif, its structure reveals a topology similar to that of the fic and doc folds. The binding of AvrB to a peptide fragment of its host virulence target is similar to the binding of fic to a peptide substrate. AvrB also orients a phosphate group from a bound ADP ligand near the peptide-binding site and in a position similar to that of a bound fic phosphate. The demonstrated eukaryotic fic domain AMPylation activity suggests that the VopS effector has exploited a novel host post-translational modification. Fic domain-related structures give insight to the AMPylation active site and to the VopS fic domain interaction with its host GTPase target. These results suggest that fic, doc, and AvrB stem from a common ancestor that has evolved to AMPylate protein substrates.
Experimental collaborations. Our group is collaborating with more than 30 labs on various biological problems. For instance, in collaboration with Michael Brown and Joseph Goldstein (both at UT Southwestern at Dallas), we identified the gene for ghrelin O-acyltransferase (GOAT). Ghrelin is a 28amino acid, appetite-stimulating peptide hormone secreted by the food-deprived stomach. Serine-3 of ghrelin is acylated with an eight-carbon fatty acid, octanoate, which is required for its endocrine actions. However, the enzyme that catalyzes this essential reaction remained unknown and difficult to find. Today, few enzymatic activities linked to medically important biological processes remain "orphan." We hypothesized that GOAT might be an enzyme from the MBOAT family, which consists of membrane proteins with similar activities. Computational analysis found 16 members of the MBOAT family in the mouse genome: their genes were cloned, proteins expressed, and activities tested. Only 1 of the 16 proteins displayed GOAT activity, and the needed enzyme was discovered. Identification of GOAT will facilitate the search for its inhibitors that reduce appetite and diminish obesity in humans.
Grants from the National Institutes of Health, the Welch Foundation, and the Citrus Research and Development Foundation provide partial support for these projects.
As of April 01, 2010