THE MICROSTATISTICS OF TEXT
Abstract
This paper is a reappraisal of the role of statistics in text analysis. Current inhibiting influences in the use of statistics are discussed. The question of descriptive vs. predictive statistics is explored at some length. A distinction between macrostatistics and microstatistics is made, with the implication that the former should be used in describing libraries whereas the latter should be used in describing written language. The second section of the paper pictures a relationship between the probability of occurrence of a word or word group in text and the cognitive effect of such a word or word group. This relationship is then illustrated through statistical data on word pairs; statistics of pairs which are directly linked in a sentence- structure tree are compared to statistics of pairs which, though the words are adjacent in text, are not directly linked in such a tree. This study of statistics as a function of sentence structure is then extended to units of text larger than a word pair. The final section discusses the problem of selecting and displaying content- indicative word groups in condensed representations of documents. It explains why the statistical approach, by itself or in conjunction with other techniques, is unavoidable in a problem such as automatic abstracting, and illustrates the perils faced by some non-statistical methods which have been talked about in the recent literature.
Document Details
- Document Type
- Technical Report
- Publication Date
- Feb 01, 1963
- Accession Number
- AD0401445
Entities
People
- Lauren B. Doyle
Organizations
- System Development Corporation