Learning to Extract Gene-Protein Names from Weakly-Labeled Text

Abstract

Training a named entity recognizer (NER) has always been a difficult task due to the effort required to generate a significant amount of annotated training data. In this paper, we reduce or eliminate the effort required to create training data by automatically converting other sources of data into annotated training data. The performance of this approach is tested on a gene-protein name extractor by using the mouse and fly data obtained from the BioCreAtIvE challenge. Results show that our methods are effective and that our trained NER system outperforms all of our baseline results.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2008
Accession Number
ADA531043

Entities

People

  • Anthony Tomasic
  • Isaac Simmons
  • Richard C. Wang
  • Robert E. Frederking
  • William W. Cohen

Organizations

  • Carnegie Mellon University

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Base Lines
  • Computer Science
  • Data Sets
  • Dictionaries
  • Filters
  • Filtration
  • Language
  • Learning
  • Machine Learning
  • Named Entity Recognition
  • Natural Language Processing
  • Precision
  • Protein-Protein Interactions
  • Test And Evaluation
  • Text Mining
  • Training

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Computer Vision.