Accelerating Exact k-means Algorithms with Geometric Reasoning

Abstract

We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included. The kd-tree can also be used to initialize the k-means starting centers efficiently. Our algorithms can be easily extended to provide fast ways of computing the error of a given cluster assignment regardless of the method in which those clusters were obtained. We also show how to use them in a setting which allows approximate clustering results, with the benefit of running faster. We have implemented and tested our algorithms on both real and simulated data. Results show a speedup factor of up to 170 on real astrophysical data, and superiority over the naive algorithm on simulated data in up to 5 dimensions. Our algorithms scale well with respect to the number of points and number of centers allowing for clustering with tens of thousands of centers.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2000
Accession Number
ADA374582

Entities

People

  • Andrew Moore
  • Dan Pelleg

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • C4I

DTIC Thesaurus Topics

  • Algorithms
  • Boundaries
  • Clustering
  • Computations
  • Computer Science
  • Data Sets
  • Distortion
  • Gaussian Distributions
  • Geometry
  • Iterations
  • Linear Programming
  • Sampling
  • Scale Models
  • Statistical Samples
  • Statistics
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Distributed Systems and Data Platform Development

Technology Areas

  • Space