Incremental Model-Based Clustering for Large Datasets With Small Clusters

Abstract

Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to and small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 10, 2003
Accession Number
ADA459790

Entities

People

  • Adrian Raftery
  • Chris Fraley
  • Ron Wehrensy

Organizations

  • University of Washington

Tags

Communities of Interest

  • Advanced Electronics
  • Biomedical

DTIC Thesaurus Topics

  • Algorithms
  • Artificial Intelligence
  • Clustering
  • Computational Science
  • Computations
  • Data Mining
  • Data Science
  • Data Sets
  • Databases
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Probability
  • Sampling
  • Standards
  • Statistics

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Computer Vision.
  • Distributed Systems and Data Platform Development