DISCRIMINANT ANALYSIS FOR CONTENT CLASSIFICATION

Abstract

A series of experiments was performed to investigate the effectiveness and utility of automatically classifying documents through the use of multiple discriminant functions. Classification is accomplished by computing the distance from the mean vector of each category to the vector of observed frequencies of a document and assigning the document to the category having the highest probability. Data concerning the effect of the principal classification parameters on classification performance is reported, based on a data base of approximately 2700 abstracts from the solid state physics field. The parameters studied were the number of sample documents required to define a category, the length of documents, the interrelationship of the number of sample documents and their lengths, the relation of the number of word types in a document to the number of categories assigned to it, levels in a structure, homogeneity of categories, and performance measures. A higher performance level was obtained when samples of 140 documents were used to define each category than with samples of 35 and 70 documents. Classification results obtained on independent test sets of documents ranged from 73 to 92 percent. The test sets contained 419 and 1333 documents. Results are also reported in terms of Swets' effectiveness measure and Cleverdon's ratios of relevance, recall and precision.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 01, 1966
Accession Number
AD0630127

Entities

People

  • John H. Williams Jr.

Organizations

  • International Business Machines Corporation (Armonk, NY)

Tags

Communities of Interest

  • Air Platforms
  • Autonomy
  • Human Systems
  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Computer Programs
  • Contracts
  • Databases
  • Discriminant Analysis
  • Frequency
  • Homogeneity
  • Information Processing
  • Information Retrieval
  • Information Science
  • Physics
  • Probability
  • Solid State Physics
  • Statistical Samples
  • Statistics
  • Test Sets

Readers

  • Business Analytics
  • Regression Analysis.