Comparing Genomes in Terms of Protein Structure: Surveys of a Finite Parts List

Abstract

We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g. analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into "fold families." This library can be built up automatically using a structure-comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and "top-10" statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms - i.e. in different kingdoms - have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of "fold-counting" is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sample bias.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1998
Accession Number
ADA472206

Entities

People

  • Hedi Hegyi
  • Mark Gerstein

Organizations

  • Yale University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Amino Acids
  • Bacteria
  • Chemical Elements
  • Chemical Synthesis
  • Chemistry
  • Computational Science
  • Computer Programming
  • Computer Programs
  • Gene Expression
  • Genetics
  • Membrane Proteins
  • Microbial Genome
  • Microbiology
  • Microorganisms
  • Molecular Biology
  • Proteins
  • Sequence Analysis

Fields of Study

  • Biology

Readers

  • Business Analytics
  • Molecular Genetics
  • Regression Analysis.