Open Information Extraction

Abstract

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This proposal introduces Open IE, a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The proposal also introduces TextRunner, a fully implemented, highly scalable Open IE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. Open IE is a very recent research breakthrough funded, in part, by our previous ONR grant on "Semantic Tractability on the World Wide Web". Here, we propose to study its efficacy and extend it in some important ways.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 31, 2010
Accession Number
ADA538482

Entities

People

  • Oren Etzioni

Organizations

  • University of Washington

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computational Linguistics
  • Computational Science
  • Computer Science
  • Data Mining
  • Hidden Markov Models
  • Information Processing
  • Information Science
  • Language
  • Machine Learning
  • Markov Models
  • Named Entity Recognition
  • Natural Language Processing
  • Natural Languages
  • Ontologies
  • Probability

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval