Tuning a Major Part of a Clustering Algorithm.
Abstract
Most proposals for clustering algorithms have been based on introspection Few proposed algorithms have had their performance studied. This approach involves (a) striving to avoid comparing distances on remote parts of the data (because metrics deserve only minimum trust), and (b) using a stochastically defined test bed to measure, and where possible understand, the performance of an evolving algorithm, with the intent of using our understanding to modify it in such a way as to improve its performance. This test bed involves 3 circular Gaussian samples, of size 50 each, centered at the vertices of an equilateral triangle of side t(sigma). Its use assumes that a 3-group answer is being sought. Thus the only concerned is with a part of the clustering process. Our early algorithms begin to misbehave in the range 5 < or = t < or = 7. Our successive steps of improvement work at smaller and smaller t. The last version we have tried still performs usefully (median misclassification about 16%) at t = 2.7, where knowledge of three populations would only let us hold misclassification to a median of 13.3%. Comparison with a Gaussian maximum likelihood algorithm on the same set of triple samples shows only slightly better performance than for our algorithm.
Document Details
- Document Type
- Technical Report
- Publication Date
- Feb 01, 1988
- Accession Number
- ADA191446
Entities
People
- John W. Tukey
- Katherine M. Hansen
Organizations
- Princeton University