Literature Mining of Pathogenesis-Related Proteins in Human Pathogens for Database Annotation

Abstract

Biomedical literature represents the primary source of experimental data and biological knowledge. This project developed a text mining system for pathogens of biodefense relevance, focusing on mining pathogen-host proteomic data. We developed a Support Vector Machine (SVM)-based system to identify abstracts containing protein interaction information using an annotated corpus of 1360 MEDLINE abstracts as the training set. It achieved good performance on document classification with a precision of over 80 among top 50 ranked abstracts. The SVM-based method is further augmented with other text mining tools (such as PIE) for mining and tagging PPI information. As part of an effort in enabling text mining tools for real world applications, we coupled our analysis with the functional annotation of proteomic experiment. All the data was then loaded into iProXpress system and provided to the collaborating USAMRIID laboratory for the analysis of bacterial pathogen proteomics data.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 01, 2009
Accession Number
AD1041385

Entities

People

  • Cathy H. Wu

Organizations

  • Georgetown University Medical Center

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Chemistry
  • Computational Biology
  • Computational Science
  • Computer Science
  • Data Analysis
  • Data Mining
  • Information Science
  • Information Systems
  • Machine Learning
  • Network Science
  • Ontologies
  • Proteins
  • Proteomics
  • Supervised Machine Learning
  • Systems Biology
  • Text Mining

Readers

  • Computational Linguistics
  • Critical Infrastructure Protection in CBRN and WMD Threats.
  • Data Mining and Knowledge Discovery.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • Biotechnology