The Development of PIPA: An Integrated and Automated Pipeline for Genome-Wide Protein Function Annotation

Abstract

Background: Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results: PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 25, 2008
Accession Number: ADA480821

Entities

People

Chenggang Yu
Fred J. Stevens
Jaques Reifman
Nela Zavaljevski
Seth Johnson
Valmik Desai

The Development of PIPA: An Integrated and Automated Pipeline for Genome-Wide Protein Function Annotation

Abstract

Document Details

Entities

People

Tags

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas