Rapid Training of Information Extraction with Local and Global Data Views

Abstract

This dissertation focuses on fast system development for Information Extraction (IE). State-of-the-art systems heavily rely on extensively annotated corpora which are slow to build for a new domain or task. Moreover, previous systems are mostly built with local evidence such as words in a short context window or features that are extracted at the sentence level. They usually generalize poorly on new domains. This dissertation presents novel approaches for rapidly training an IE system for a new domain or task based on both local and global evidence. Specifically, we present three systems: a relation type extension system based on active learning a relation type extension system based on semi-supervised learning, and a crossdomain bootstrapping system for domain adaptive named entity extraction. The active learning procedure adopts features extracted at the sentence level as the local view and distributional similarities between relational phrases as the global view. It builds two classifiers based on these two views to find the most informative contention data points to request human labels so as to reduce annotation cost. The semi-supervised system aims to learn a large set of accurate patterns for extracting relations between names from only a few seed patterns. It estimates the confidence of a name pair both locally and globally: locally by looking at the patterns that connect the pair in isolation; globally by incorporating the evidence from the clusters of patterns that connect the pair. The use of pattern clusters can prevent semantic drift and contribute to a natural stopping criterion for semisupervised relation pattern discovery. For adapting a named entity recognition system to a new domain, we propose a cross-domain bootstrapping algorithm, which iteratively learns a model for the new domain with labeled data from the original domain and unlabeled data from the new domain.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2012
Accession Number
ADA587149

Entities

People

  • Ang Sun

Organizations

  • New York University

Tags

Communities of Interest

  • Autonomy
  • Biomedical

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Computer Science
  • Cross Domain
  • Feature Extraction
  • Hidden Markov Models
  • Kernel Functions
  • Machine Learning
  • Markov Models
  • Named Entity Recognition
  • Natural Language Processing
  • Probability
  • Probability Distributions
  • Recognition
  • Semi-Supervised Learning
  • Supervised Machine Learning
  • Theses
  • Training

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks