Enhancing Text Analysis via Dimensionality Reduction

Abstract

Many applications require analyzing vast amounts of textual data, but the size and inherent noise of such data can make processing very challenging. One approach to these issues is to mathematically reduce the data so as to represent each document using only a few dimensions. Techniques for performing such "dimensionality reduction" (DR) have been well-studied for geometric and numerical data, but more rarely applied to text. In this paper, we examine the impact of five DR techniques on the accuracy of two supervised classifiers on three textual sources. This task mirrors important real world problems, such as classifying web pages or scientific articles. In addition, the accuracy serves as a proxy measure for how well each DR technique preserves the inter-document relationships while vastly reducing the size of the data, facilitating more sophisticated analysis. We show that, for a fixed number of dimensions, DR can be very successful at improving accuracy compared to using the original words as features. Surprisingly, we also find that one of the simplest DR techniques, Multi-dimensional Scaling (MDS), is among the most effective. This suggests that textual data may often lie upon a linear manifold where the more complex non-linear DR techniques do not have an advantage.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 2007
Accession Number
ADA479796

Entities

People

  • David G. Underhill
  • David J. Marchette
  • Jeffrey L. Solka
  • Luke K. Mcdowell

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Accuracy
  • Algorithms
  • Applied Computer Science
  • Artificial Intelligence
  • Biological Sciences
  • Classification
  • Computer Science
  • Data Science
  • Data Sets
  • Dimensionality Reduction
  • Feature Extraction
  • Feature Selection
  • Language
  • Machine Learning
  • Natural Language Processing
  • Natural Languages
  • Text Mining

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Finite Element Method (FEM) for solving Partial Differential Equations (PDEs)
  • Systems Analysis and Design