Weight Adjustment Schemes for a Centroid Based Classifier

Abstract

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on these huge resources. Similarity based categorization algorithms such as k-nearest neighbor, generalized instance set and centroid based classification have been shown to be very effective in document categorization. A major drawback of these algorithms is that they use all features when computing the similarities. In many document data sets, only a small number of the total vocabulary may be useful for categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this report we present two fast iterative feature weight adjustment algorithms for the linear complexity centroid based classification algorithm. Our algorithms use a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. We experimentally evaluate our algorithms on the Reuters-21578 and OHSUMED document collections and compare it against Rocchio, Widrow-Hoff and SVM. We also compared its performance in terms of classification accuracy on data sets with multiple classes. On these data sets we compared its performance against traditional classifiers such as k-nn, Naive Bayesian and C4.5. Experiments show that feature weight adjustment improves the performance of the centroid-based classifier by 2- 5%, substantially outperforms Rocchio and Widrow-Hoff and is competitive with SVM. These algorithms also outperform traditional classifiers such as k-nn, naive bayesian and C4.5 on the multi-class text document data sets.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 30, 2000
Accession Number
ADA439629

Entities

People

  • George Karypis
  • Shrikanth Shankar

Organizations

  • University of Minnesota

Tags

Communities of Interest

  • Ground and Sea Platforms

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Algorithms
  • Applied Computer Science
  • Artificial Intelligence
  • Classification
  • Clustering
  • Computational Complexity
  • Computer Science
  • Data Sets
  • Feature Selection
  • Frequency
  • Information Retrieval
  • Iterations
  • Machine Learning
  • Supervised Machine Learning
  • Vector Spaces

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Distributed Systems and Data Platform Development
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Neural Networks