Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization

Abstract

In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 06, 2000
Accession Number
ADA439511

Entities

People

  • Euihong Han
  • George Karypis

Organizations

  • University of Minnesota

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Abstracts
  • Accuracy
  • Algorithms
  • Artificial Intelligence
  • Classification
  • Clustering
  • Computer Science
  • Data Sets
  • Dimensionality Reduction
  • Engineering
  • Information Operations
  • Machine Learning
  • Mathematics
  • Military Research
  • Minnesota
  • Natural Language Processing
  • World Wide Web

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Database Systems and Applications

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms
  • Space