An Evaluation of Statistical Approaches to Text Categorization,

Abstract

This paper is a comparative study of test categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corps biases in commonly used document collection are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had relatively poor results compared to the other learning methods. KNN is the only learning method that has scaled to the full domain of MEDLINE categories, showing a graceful behavior when the target space grows from the level of one hundred categories to a level of tens of thousands

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 10, 1997
Accession Number
ADA327980

Entities

People

  • Yiming Yang

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Algorithms
  • Analysis Of Variance
  • Classification
  • Computer Science
  • Computers
  • Data Sets
  • Diseases And Disorders
  • Errors
  • Expert Systems
  • Frequency
  • Heart Diseases
  • Inclusions
  • Information Science
  • Machine Learning
  • Neural Networks
  • Precision
  • Test Sets

Readers

  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks
  • Space