Unsupervised Discovery of Extraction Patterns for Information Extraction

Abstract

The task of Information Extraction (IE) is to find specific types of information in natural language text. In particular, event extraction identifies instances of a particular type of event or fact (a particular "scenario"), including the entities involved, and fills a database which has been pre-defined for the scenario, As the number of documents available on-line has multiplied, entity extraction has grown in importance for various applications, including tracking terrorist activities from newswire sources and building a database of job postings from the Web, to name a few. Linguistic contexts, such as predicate-argument relationships, have been widely used as extraction patterns to identify the items to be extracted from the text. The cost of creating extraction patterns for each scenario has been a bottleneck limiting the portability of information extraction systems to different scenarios, although there has been some research on semi-supervised pattern discovery procedures to reduce this cost. The challenge is to develop a fully automatic method for identifying extraction patterns for a scenario specified by the user. This dissertation presents a novel approach for the unsupervised discovery of extraction patterns for event extraction from raw text. First, we present a framework that allows the user to have a self-customizing information extraction system for his/her query: the Query-Driven Information Extraction (QDIE) framework. The input to the QDIE framework is the user's query: either a set of keywords or a narrative description of the event extraction task. Second, we assess the improvement in extraction pattern models. By considering the shortcomings of the prior work based on predicate-argument models and their extensions, we propose a novel extraction pattern model that is based on arbitrary subtrees of dependency trees. Third, we address the issue of portability across languages.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2004
Accession Number
ADA598383

Entities

People

  • Kiyoshi Sudo

Organizations

  • New York University

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Computer Science
  • Databases
  • Extraction
  • Formal Languages
  • Information Operations
  • Instructions
  • Language
  • Named Entity Recognition
  • Natural Languages
  • Naval Warfare
  • New York
  • Precision
  • Theses
  • Translations
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Database Systems and Applications
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval