The effectiveness of lloyd-type methods for the k-means problem

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

Document Details

Document Type
Pub Defense Publication
Publication Date
Dec 01, 2012
Source ID
10.1145/2395116.2395117

Entities

People

  • Chaitanya Swamy
  • Leonard J. Schulman
  • Rafail Ostrovsky
  • Yuval Rabani

Organizations

  • Bulgarian Science Fund
  • California Institute of Technology
  • Hebrew University of Jerusalem
  • Israel Science Foundation
  • National Science Foundation
  • Natural Sciences and Engineering Research Council
  • Office of Naval Research
  • University of California, Los Angeles
  • University of Waterloo

Tags

Fields of Study

  • Computer science

Readers

  • Economics
  • Neural Network Machine Learning.
  • Operations Research