Deducing Linguistic Structure from the Statistics of Large Corpora

Abstract

Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4% error rate, when trained on moderate sized (500K word) corpora of English text (e.g. Church, 1988; Hindle, 1989). The success of these techniques suggests that much of the grammatical structure of language may be derived automatically through distributional analysis, an approach attempted and abandoned in the 1950s. We describe here two experiments to see how far purely distributional techniques can be pushed to automatically provide both a set of part of speech tags for English, and a grammatical analysis of free English text. We also discuss the state of a tagged NL corpus to aid such research (now amounting to 4 million words of hand-corrected part-of-speech tagging).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1990
Accession Number
ADA458686

Entities

People

  • Beatric Santorini
  • David Magerman
  • Mitchell Marcus

Organizations

  • University of Pennsylvania

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Automated Speech Recognition
  • Boundaries
  • Computational Science
  • Data Science
  • Errors
  • Frequency
  • Grammars
  • Information Science
  • Language
  • Linguistics
  • Natural Language Processing
  • Natural Languages
  • Probabilistic Models
  • Probability
  • Statistics

Readers

  • Computational Linguistics
  • Systems Analysis and Design