A Web-based High-Throughput Tool for Next-Generation Sequence Annotation

Abstract

The availability of a large number of genome sequences, resulting from inexpensive, high-throughput next-generation sequencing platforms, has created the need for an integrated, fully-automated, rapid, and high-throughput annotation capability that is also easy-to-use. Here, we present a web-based software application, Annotation of Genome Sequences (AGeS), which incorporates publicly-available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. The current version of AGeS provides annotations for bacterial genome sequences, and serves as a readily-accessible resource to Department of Defense (DoD) scientists for storing, annotating and visualizing genomes of newly-sequenced pathogens of interest. The AGeS system is composed of two major components. The first component is a web-based application that provides a graphical user interface for managing users' input genomes, submitting annotation jobs, and visualizing results. Sequence contigs are uploaded as a multi-FASTA input file and submitted for annotation, and the resulting annotations are visualized through GBrowse. The input genome sequences and the annotation results are stored in a secure, customized database. The second component is a high-throughput annotation pipeline for finding the genomic regions that code for proteins, RNAs and other genomic elements through a Do-It-Yourself Annotation framework. The pipeline also functionally annotates the protein-coding regions using an in-house-developed high-throughput pipeline, the Pipeline for Protein Annotation. The annotation pipeline has been deployed on the Mana Linux cluster at the Maui High Performance Computing Center. The two components are connected together using the DoD user interface toolkit application programming interface. The AGeS system was evaluated for scaling of its parallel execution and annotation performance. AGeS scaled with super-linear speedup for up to 128 processors.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2011
Accession Number
ADA572794

Entities

People

  • Chenggang Yu
  • Deepak Grover
  • Jaques Reifman
  • Kamal Kumar
  • Li Cheng
  • Maxim Khitrov
  • Nela Zavaljevski
  • Ravi V. Satya
  • Valmik Desai

Organizations

  • United States Army Medical Research and Development Command

Tags

DTIC Thesaurus Topics

  • Application Software
  • Computer Programming
  • Computer Science
  • Computers
  • Database Management Systems
  • Databases
  • Demographic Cohorts
  • Department Of Defense
  • Graphical User Interface
  • High Performance Computing
  • Information Systems
  • Parallel Computing
  • Relational Database Management Systems
  • Sequence Analysis
  • User Interface
  • Web Applications
  • Web Browsers

Fields of Study

  • Biology
  • Engineering

Readers

  • Database Systems and Applications
  • Molecular Genetics
  • Parallel and Distributed Computing.