Locally Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces (Preprint)

Abstract

Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current state-of-the-art technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dimensionality Reduction (LDR) that tries to find local correlations in the data and performs dimensionality reduction on the locally correlated clusters of data individually. We develop an index structure that exploits the correlated clusters to efficiently support point, range and k-nearest neighbor queries over high dimensional datasets. Our experiments on synthetic as well as real-life datasets show that our technique (1) reduces the dimensionality of the data with significantly lower loss in distance information compared to GDR and (2) significantly outperforms the GDR, original space indexing and linear scan techniques in terms of the query cost for both synthetic and real-life datasets.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2000
Accession Number
ADA466132

Entities

People

  • Kaushik Chakrabarti
  • Sharad Mehrotra

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Algorithms
  • Clustering
  • Correlation Analysis
  • Costs
  • Data Analysis
  • Data Mining
  • Data Science
  • Data Sets
  • Databases
  • Dimensionality Reduction
  • Eigenvalues
  • Errors
  • Factor Analysis
  • Information Science
  • Precision
  • Statistical Samples
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • Space