Scientists & Research
  Overview  
dashed line
Investigators
dashed line
  JFRC Scientists  
dashed line
  Early Career Scientists  
dashed line
  TB/HIV  
dashed line
  Internatinal Scholars  
dashed line
  Nobel Laureates  
dashed line
Scientific Competitions
dashed line
  FindSci  

HHMI-NIH Research Scholars
Learn about the HHMI-NIH Research Scholars Program, also known as the Cloister Program. Moresmall arrow

dashed line

Janelia Farm Research Campus
Learn about the new HHMI research campus located in Virginia. Moresmall arrow

Computational Biology of Proteins


Summary: Nick Grishin is interested in using theoretical methods to understand proteins. He combines sequence and structure analysis with evolutionary considerations to facilitate discoveries of biological significance.

Millions of species of living organisms on the planet possess billions of different proteins. This enormous diversity has evolved from a limited number of ancestral proteins, probably about a thousand. An expansion of more than 6 orders of magnitude in protein numbers has produced rich material for studying the laws of evolution.

From a pragmatic perspective, evolutionary links between proteins offer shortcuts to gain knowledge about homologs from a few experimentally characterized representatives. A homology link detected from sequences is the most powerful source of structure prediction, often leads to functional insights, and can guide experimental design.

From a theoretical perspective, we would like to understand how biological diversity is generated. Amazingly, the same themes and motifs are recurrently used, elaborately modified, and combined in evolution to produce functional entities. Our goal is to uncover these prevalent mechanisms in protein evolution.

Recent advances in obtaining sequence and structure information (~2,000,000 nonidentical sequences and ~20,000 different spatial structures) make for productive computational analysis. A grand step toward comprehending the protein universe would be the classification of sequence-structure data into an evolutionarily relevant hierarchical system. When two proteins display clear similarity, the task is straightforward. When similarity is low, however, discrimination between evolutionarily meaningful and spurious relationships becomes highly nontrivial. Since no available method deals well with this difficult problem, we develop new mathematical approaches to explore protein sequence-structure data, and since no single narrow approach is able to find remote homologs, we combine sequence and structure analysis with evolutionary considerations to facilitate discoveries of biological significance.

Method Development
Our goal is to improve computational methods to assess protein similarity. During the past year, we concentrated on the three-dimensional structural level. None of the numerous methods proposed for a search for structure similarity achieves the most fundamental task—finding structural motifs that satisfy the general definition of a protein fold: globular units with the same secondary structure, topology, and architecture, irrespective of subtle differences in packing between elements. Such a program is key in mining protein structures. TOPS, the closest available utility, falls short, since it checks topology only. While developing a structural motif search algorithm, we were surprised to find that a method providing broad, yet accurate, secondary structure delineation was lacking as well.

Delineation of secondary structural elements. To generate input for the fold search, we use Cα coordinates to delineate secondary structural elements. The elements we define—helices and strands—are linear, so they can be approximated by vectors. These elements cover about 85 percent of residues in the structure, so they can provide maximum representation in an element-based search. Our program is predictive by nature; i.e., it does not evaluate the actual existence of hydrogen bonds, as does the classic approach DSSP, but rather checks whether the general geometry is sufficiently close to provide H-bond presence in some conditions. As such, our approach is robust to coordinate errors up to 1.5 Å RMSD and is more suitable as the first step to finding remote structural similarities.

ProSMoS—protein structure motif search. Programs that assess structural similarity compare two structures to each other and define common regions. Structural classification experts look for a particular structural motif instead. Most programs base similarity scores on superposition and closeness of either coordinates or contacts. Experts pay more attention to the orientation of the main chain and general match of secondary structural elements. We developed a program that emulates an expert. Starting from a structure, the program delineates secondary structural elements. A matrix of element interactions (parallel or antiparallel) and handedness of connections is constructed. All structures are reduced to matrices that contain just enough information to define a fold, so the definition is general and large deviations in coordinates are tolerated. A user supplies a matrix for a motif, and ProSMoS lists all structures that exactly match this motif.

SCOPmap: homology inference between protein structures. A difficult and time-consuming task is determining whether proteins with remotely similar structures share a common ancestor. To tackle this task computationally, we designed a strategy that uses various existing programs to check sequence and structure similarity statistics, combines their scores, and attributes a protein structure to a previously defined evolutionary classification. Our automatic method assigns about 95 percent of proteins to SCOP, leaving the rest to expert analysis. The algorithm is also able to identify potential evolutionary relationships not specified in the SCOP database. The strategy of the mapping algorithm is not limited to SCOP and can be applied to any other classification scheme based on evolution.

Application to Biological Problems
As our main interest is to find novel remote homology links between proteins, we applied various computational methods enhanced by manual expert analysis to explore sequence and structure databases.

Classification of kinases. Understanding of both function and evolution is increased by bringing together all proteins that share significant functional similarities and looking at the whole group from an evolutionary perspective. Such work also results in many challenging structure predictions. We define kinases as enzymes transferring the terminal P-group from ATP to another molecule. Although protein kinases are frequently analyzed and classified, no comprehensive classification of all kinases, including many families of metabolic small-molecule kinases, existed. Overall, we classified the kinase sequences into 25 families of homologous proteins. We were able to perform structural annotations of all experimentally characterized kinase families, making this the first large functional class of proteins with a comprehensive structural annotation.

Protein structure prediction for the male-specific region of the human Y chromosome. Computational genomics necessitates annotation of proteins encoded in complete genomes. Meticulous expert-driven study of proteins from a single organism is capable of finding interesting functional annotations and structure predictions that are missed in large-scale automatic analyses. We applied our experience to the male-specific region of the human Y chromosome (MSY). We found that, in total, at least 60 domains are encoded by 27 distinct MSY genes; 42 (70 percent) of these domains were reliably mapped to known structures. The most challenging predictions include the unexpected but confident 3D structure assignments for three domains encoded by the USP9Y, UTY, and BPY2 genes. The domains with unknown 3D structures that are not predictable with available theoretical methods are primary targets for crystallographic or nuclear magnetic resonance studies.

EDD: a novel phosphotransferase domain. Using our program SCOPmap—designed to assign new protein structures automatically to existing evolutionary-based classification schemes—we identified an evolutionarily conserved domain (EDD) common to three different folds: mannose transporter EIIA domain (EIIA-man), dihydroxyacetone kinase (Dak), and DegV. Several lines of evidence support unification of these three folds into a single superfamily: statistically significant sequence similarity detected by PSI-BLAST; and "closed structural grouping" using DALI Z-scores (each protein inside a group finds all other group members with scores higher than those of proteins outside the group) that includes only these proteins sharing a unique α-helical hairpin at the carboxyl terminus and excludes all other proteins with similar topology. Finally, both Dak and EIIA-man perform similar phosphotransfer reactions, suggesting a phosphotransferase activity for the DegV-like family of proteins, whose function, other than lipid binding revealed in the crystal structure, remains unknown.

Last updated July 01, 2009

HHMI INVESTIGATOR

Nick V. Grishin
Nick V. Grishin
 

Related Links

ON THE WEB

external link icon

The Grishin Lab
(swmed.edu)

search icon Search PubMed
dashed line
 Back to Topto the top
© 2010 Howard Hughes Medical Institute. A philanthropy serving society through biomedical research and science education.
4000 Jones Bridge Road, Chevy Chase, MD 20815-6789 | (301) 215-8500 | email: webmaster@hhmi.org