Genome-Wide Enzyme Annotation with Precision Control: Catalytic Families (CatFam) Databases

Abstract

In this article, we present a new method termed Cat- Fam (Catalytic Families) to automatically infer the functions of catalytic proteins, which account for 20-40% of all proteins in living organisms and play a critical role in a variety of biological processes. CatFam is a sequence-based method that generates sequence profiles to represent and infer protein catalytic functions. CatFam generates profiles through a stepwise procedure that carefully controls profile quality and employs nonenzymes as negative samples to establish profile-specific thresholds associated with a predefined nominal false-positive rate (FPR) of predictions. The adjustable FPR allows for fine precision control of each profile and enables the generation of profile databases that meet different needs: function annotation with high precision and hypothesis generation with moderate precision but better recall. Multiple tests of CatFam databases (generated with distinct nominal FPRs) against enzyme and nonenzyme datasets show that the method's predictions have consistently high precision and recall. For example, a 1% FPR database predicts protein catalytic functions for a dataset of enzymes and nonenzymes with 98.6% precision and 95.0% recall. Comparisons of CatFam databases against other established profile-based methods for the functional annotation of 13 bacterial genomes indicate that CatFam consistently achieves higher precision and (in most cases) higher recall, and that (on average) CatFam provides 21.9% additional catalytic functions not inferred by the other similarly reliable methods. These results strongly suggest that the proposed method provides a valuable contribution to the automated prediction of protein catalytic functions. The CatFam databases and the database search program are freely available at http://www.bhsai.org/ downloads/catfam.tar.gz.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2008
Accession Number
ADA593510

Entities

People

  • Chenggang Yu
  • Jaques Reifman
  • Nela Zavaljevski
  • Valmik Desai

Organizations

  • United States Army Medical Research and Development Command

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Amino Acids
  • Application Software
  • Computational Science
  • Computer Programs
  • Data Sets
  • Databases
  • Demographic Cohorts
  • High Performance Computing
  • Information Science
  • Machine Learning
  • Metabolic Pathways
  • Neutral Amino Acids
  • Precision
  • Sequences
  • Supervised Machine Learning
  • Three Dimensional
  • United States

Fields of Study

  • Biology

Readers

  • Database Systems and Applications
  • Molecular Genetics
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval