Density Biased Sampling: An Improved Method for Data Mining and Clustering

Abstract

Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions. Our proposed method scales linearly and out performs uniform samples when clustering realistic data sets.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 26, 1999
Accession Number
ADA366202

Entities

People

  • Christopher R. Palmer
  • Christos Faloutsos

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • C4I

DTIC Thesaurus Topics

  • Algorithms
  • Automated Text Summarization
  • Biological Sciences
  • Computer Science
  • Data Mining
  • Data Sets
  • Databases
  • Gaussian Distributions
  • Hash Tables
  • Information Retrieval
  • Information Science
  • Network Science
  • Probability
  • Sampling
  • Statistical Algorithms
  • Statistical Samples
  • Statistical Sampling

Fields of Study

  • Computer science

Readers

  • Approximation Theory.
  • Neural Network Machine Learning.
  • Theoretical Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms