Facilitating Treebank Annotation Using a Statistical Parser

Abstract

Corpora of phrase-structure-annotated text, or treebanks, are useful for supervised training of statistical models for natural language processing, as well as for corpus linguistics. Their primary drawback, however, is that they are very time-consuming to produce. To alleviate this problem, the standard approach is to make two passes over the text: first, parse the text automatically, then correct the parser output by hand. In this paper we explore three questions: How much does an automatic first pass speed up annotation? Does this automatic first pass affect the reliability of the final product? What kind of parser is best suited for such an automatic first pass? We investigate these questions by an experiment to augment the Penn Chinese Treebank [15] using a statistical parser developed by Chiang [3] for English. This experiment differs from previous efforts in two ways: first, we quantify the increase in annotation speed provided by the automatic first pass (70 100%); second, we use a parser developed on one language to augment a corpus in an unrelated language.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2001
Accession Number
ADA460488

Entities

People

  • David Chiang
  • Fu-dong Chiou
  • Martha Palmer

Organizations

  • University of Pennsylvania

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Automatic
  • Commerce
  • Grammars
  • Information Operations
  • Information Science
  • Intellectual Property
  • Language
  • Linguistics
  • Natural Language Processing
  • Natural Languages
  • Precision
  • Property Rights
  • Reliability
  • Standards
  • Test And Evaluation

Readers

  • Computational Linguistics
  • Computational Modeling and Simulation

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks