The Development of PIPA: An Integrated and Automated Pipeline for Genome-Wide Protein Function Annotation

Abstract

Background: Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results: PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 25, 2008
Accession Number
ADA480821

Entities

People

  • Chenggang Yu
  • Fred J. Stevens
  • Jaques Reifman
  • Nela Zavaljevski
  • Seth Johnson
  • Valmik Desai

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Application Software
  • Biological Processes
  • Computational Science
  • Computations
  • Computer Programs
  • Consensus Algorithms
  • Databases
  • Demographic Cohorts
  • High Performance Computing
  • Integrated Systems
  • Machine Learning
  • Ontologies
  • Pipelines
  • Precision
  • Sequence Analysis
  • Topology

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computational Modeling and Simulation
  • Molecular Genetics

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks