PIPA: A High-Throughput Pipeline for Protein Function Annotation

Abstract

Traditional experimental methods to determine the functions of proteins encoded in genomic sequences cannot keep pace with the avalanche of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bioinformatics approaches for automated protein function annotation. However, different function classification terminologies are frequently used by these different approaches, precluding the integration of multisource predictions. We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high-performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipeline's parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an in-house-developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2008
Accession Number
ADA526660

Entities

People

  • Chenggang Yu
  • Jaques Reifman
  • Nela Zavaljevski
  • Valmik Desai

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Application Software
  • Biomedical Research
  • Computational Science
  • Computer Programming
  • Computer Programs
  • Consensus Algorithms
  • Conversion
  • Demographic Cohorts
  • Graphical User Interface
  • High Performance Computing
  • Information Systems
  • Military Research
  • Pathogenic Bacteria
  • Sequence Analysis
  • Sequences
  • User Interface

Fields of Study

  • Biology
  • Engineering

Readers

  • Computer Science.
  • Library and Information Science
  • Molecular Genetics