THE MICROSTATISTICS OF TEXT

Abstract

This paper is a reappraisal of the role of statistics in text analysis. Current inhibiting influences in the use of statistics are discussed. The question of descriptive vs. predictive statistics is explored at some length. A distinction between macrostatistics and microstatistics is made, with the implication that the former should be used in describing libraries whereas the latter should be used in describing written language. The second section of the paper pictures a relationship between the probability of occurrence of a word or word group in text and the cognitive effect of such a word or word group. This relationship is then illustrated through statistical data on word pairs; statistics of pairs which are directly linked in a sentence- structure tree are compared to statistics of pairs which, though the words are adjacent in text, are not directly linked in such a tree. This study of statistics as a function of sentence structure is then extended to units of text larger than a word pair. The final section discusses the problem of selecting and displaying content- indicative word groups in condensed representations of documents. It explains why the statistical approach, by itself or in conjunction with other techniques, is unavoidable in a problem such as automatic abstracting, and illustrates the perils faced by some non-statistical methods which have been talked about in the recent literature.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 01, 1963
Accession Number
AD0401445

Entities

People

  • Lauren B. Doyle

Organizations

  • System Development Corporation

Tags

Communities of Interest

  • Biomedical
  • Space

DTIC Thesaurus Topics

  • Brain
  • Brain Injuries
  • Computer Programs
  • Data Processing
  • Dictionaries
  • Gravitational Fields
  • Human Behavior
  • Information Processing
  • Information Retrieval
  • Information Science
  • Language
  • Linguistics
  • Psychological Tests
  • Psychology
  • Statistical Analysis
  • Statistical Samples
  • Thinking

Fields of Study

  • Mathematics

Readers

  • Computational Linguistics
  • Regression Analysis.