Collective Segmentation and Labeling of Distant Entities in Information Extraction

Abstract

In information extraction, we often wish to identify all mentions of an entity such as a person or organization. Traditionally a group of words is labeled as an entity based only on local information. But information from throughout a document can be useful; for example if the same word is used multiple times it is likely to have the same label each time. We present a CRF that explicitly represents dependencies between the labels of pairs of similar words in a document. On a standard information extraction data set we show that learning these dependencies leads to a 13.7% reduction in error on the field that had caused the most repetition errors.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 2004
Accession Number
ADA439444

Entities

People

  • Andrew McCallum
  • Charles Sutton

Organizations

  • University of Massachusetts Amherst

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Linguistics
  • Computational Science
  • Computer Science
  • Computer Vision
  • Data Sets
  • Extraction
  • Language
  • Linguistics
  • Machine Learning
  • Natural Language Processing
  • Probabilistic Models
  • Probability
  • Probability Distributions

Fields of Study

  • Computer science

Readers

  • Speech Processing/Speech Recognition.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks