Statistical Methods for Technical Document Retrieval.
Abstract
The RADC Automatic Document Classification On-Line (RADCOL) system is a tool for testing various statistical procedures for document analysis and retrieval, and for the design of operational systems. This report describes experiments which used the RADCOL system; it was found, as had been predicted, that procedures for clustering word stems did not provide substantial savings in space and time, and that an unclustered thesaurus gave improved retrieval capabilities. Three new versions of the system were implemented, with weights of 0.0, 0.5, and 1.0 assigned to identity correlations (correlations of word stems with themselves). Because of superior performance of the system using 1.0 correlations, a simplified version of the retrieval technique was recommended for use with science and technology abstracts. In the simplified system, automatic thesaurus generation would be eliminated, and a large technical vocabulary would be used. Retrievals would use direct correlations between queries and documents. These experiments are believed to be the most comprehensive series of tests of statistical retrieval methods performed on a data base of realistic size. Further experimentation is recommended to determine the applicability of statistical methods to other types of intelligence data bases and user requirements.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jun 01, 1977
- Accession Number
- ADA041845
Entities
People
- Harry M. Hersh
- James R. Wilson
- John M. Morris
- Katherine J. Morris