Sparse Information Extraction: Unsupervised Language Models to the Rescue

Abstract

Even in a massive corpus such as the Web, a substantial fraction of extractions appear infrequently. This paper shows how to assess the correctness of sparse extractions by utilizing unsupervised language models. The REALM system, which combines HMM-based and n-gram-based language models ranks candidate extractions by the likelihood that they are correct. Our experiments show that REALM reduces extraction error by 39%, on average, when compared with previous work. Because REALM pre-computes language models based on its corpus and does not require any hand-tagged seeds, it is far more scalable than approaches that learn models for each individual relation from handtagged data. Thus, REALM is ideally suited for open information extraction where the relations of interest are not specified in advance and their number is potentially vast.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 01, 2007
Accession Number
ADA534427

Entities

People

  • Doug Downey
  • Oren Etzioni
  • Stefan Schoenmackers

Organizations

  • University of Washington

Tags

Communities of Interest

  • C4I
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Science
  • Computer Languages
  • Computer Science
  • Hidden Markov Models
  • Information Retrieval
  • Information Science
  • Language
  • Machine Learning
  • Markov Models
  • Named Entity Recognition
  • Natural Language Processing
  • Probabilistic Models
  • Probability
  • Probability Distributions

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval