Some Statistical Opportunities in Speech and Language,

Abstract

Text analysis is a hot topic, and for good reason. Text is more available than ever before. Just ten years ago, the one-million word Brown Corpus (Francis and Kucera, 1982) was still considered large, but even then, there were much larger corpora in use such as the 18 million word Birmingham Corpus (Sinclair 1987a, 1987b). These days, there are many places that regularly use samples of text running into the hundreds of millions of words. And it is very likely that billions of words will be available very soon. All of this data provides a great research opportunity; it easier these days to corpus data much more effectively than it was in the 1950s, the last time that empiricism was in fashion. Text analysis focuses on broad (though possibly superficial) coverage of unrestricted text, rather than a deep analysis of a restricted domain. Ms pragmatic view toward coverage and performance distinguishes text analysis from so-called intelligent approaches such as natural language understanding. This approach has produced a number of tools such as spelling correctors and part of speech taggers that work on unrestricted text, with reasonable accuracy and efficiency.

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1992
Accession Number
ADP007097

Entities

People

  • Kenneth W. Church

Organizations

  • University of Southern California

Tags

DTIC Thesaurus Topics

  • Accuracy
  • Computer Languages
  • Computer Science
  • Efficiency
  • Engineering
  • Formal Languages
  • Language
  • Natural Language Understanding
  • Natural Languages
  • Statistics
  • Theoretical Computer Science

Readers

  • Computational Linguistics
  • Theoretical Analysis.