THE MICROSTATISTICS OF TEXT

Abstract

This paper is a reappraisal of the role of statistics in text analysis. Current inhibiting influences in the use of statistics are discussed. The question of descriptive vs. predictive statistics is explored at some length. A distinction between macrostatistics and microstatistics is made, with the implication that the former should be used in describing libraries whereas the latter should be used in describing written language. The second section of the paper pictures a relationship between the probability of occurrence of a word or word group in text and the cognitive effect of such a word or word group. This relationship is then illustrated through statistical data on word pairs; statistics of pairs which are directly linked in a sentence- structure tree are compared to statistics of pairs which, though the words are adjacent in text, are not directly linked in such a tree. This study of statistics as a function of sentence structure is then extended to units of text larger than a word pair. The final section discusses the problem of selecting and displaying content- indicative word groups in condensed representations of documents. It explains why the statistical approach, by itself or in conjunction with other techniques, is unavoidable in a problem such as automatic abstracting, and illustrates the perils faced by some non-statistical methods which have been talked about in the recent literature.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Feb 01, 1963
Accession Number: AD0401445

Entities

People

Lauren B. Doyle

Organizations

System Development Corporation

THE MICROSTATISTICS OF TEXT

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers