Exploring Dimensionality Reduction for Text Mining

Abstract

Text mining is the extraction of important information from a collection of textual data sources. For instance, text mining can be used to discover related concepts or to categorize previously unseen documents. In this age of information overload, text mining applications can potentially yield tremendous benefits to both individuals and organizations. However, the effectiveness of text mining is limited by the large volume of textual data, as well as its complex and noisy characteristics. Both of these challenges can be addressed with "dimensionality reduction" (DR). DR is the process of transforming a large amount of data into a much smaller, less noisy representation that preserves important relationships from the original data. DR techniques have been shown to effectively simplify large geometric datasets, but have yet to be adequately evaluated for textual data. This project evaluated five DR techniques (Principal Components Analysis, Multidimensional Scaling, Isomap, Locally Linear Embedding, and Laplace-Beltrami Diffusion Maps) from two distinct perspectives. First, the impact of each DR technique on the ability to automatically perform document classification on corpuses of scientific abstracts or news articles was measured. For each technique, the dataset was reduced, then a standard linear, quadratic, or nearest neighbor classifier was used to assign categories to a test set of documents based upon a labeled training set. Results showed that, for any fixed number of dimensions used by the classifier, performing any kind of DR almost always improved classification accuracy compared to using the non-reduced data. Amongst different DR techniques, Isomap and Multi-dimensional Scaling were best able to reduce the data and eliminate noise, yielding improved accuracy. This suggests that these textual data sets lie primarily on a linear manifold for which the more complex non-linear techniques do not have an advantage.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 04, 2007
Accession Number
ADA473266

Entities

People

  • David G. Underhill

Organizations

  • United States Naval Academy

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Biological Sciences
  • Computer Science
  • Computers
  • Data Sets
  • Dimensionality Reduction
  • Earth Sciences
  • Machine Learning
  • Natural Language Processing
  • Prostate Cancer
  • Signal Processing
  • Standards
  • Test Sets
  • Text Mining
  • Training
  • Two Dimensional
  • United States Naval Academy

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Neural Network Machine Learning.
  • Regression Analysis.