Analysis of a Probabilistic Model of Redundancy in Unsupervised Information Extraction

Abstract

Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web UIE systems present an approach to a primary challenge in artificial intelligence: the automatic accumulation of massive bodies of knowledge. A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? We present a combinatorial "balls-and-urns" model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 25, 2010
Accession Number
ADA632614

Entities

People

  • Doug Downey
  • Oren Etzioni
  • Stephen Soderland

Organizations

  • University of Washington

Tags

Communities of Interest

  • Energy and Power Technologies
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computer Languages
  • Estimators
  • Information Science
  • Machine Learning
  • Models
  • Network Science
  • Probabilistic Models
  • Probability
  • Random Variables
  • Redundancy
  • Repetition Rate
  • Supervised Machine Learning
  • Training
  • Unsupervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Statistical inference.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Information Retrieval