A Scale Independent, Noise Resistant Dissimilarity for Tree Based Clustering of Mixed Data

Abstract

Clustering techniques divide observations into groups. Current techniques usually rely on measurements of dissimilarities between pairs of observations, between pairs of clusters, and between an observation and a cluster. For numeric variables, these dissimilarity measurements often depend on the scaling of the variables, are changed by monotonic transformations, and do not provide for selection of important" variables. In our scheme, we fit a set of regression or classification trees with each variable acting in turnas the "response" variable. Points are "close" to one another if they tend to appear in the same leaves of these trees. Trees with poor predictive power are discarded. Therefore, "noise" variables will often appear in none of the trees and have no effect on the clustering. Because our technique uses trees, the dissimilarities are unaffected by linear transformations of the numeric variables and resistant to monotonic ones and to outliers. Categorical variables are included automatically and missing values handled in a natural way. We demonstrate the performance of this technique by using these dissimilarities to cluster some well-known data sets to which noise has been added.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 19, 2016
Accession Number
AD1060257

Entities

People

  • Lyn R. Whitaker
  • Samuel E. Buttrey

Organizations

  • Naval Postgraduate School

Tags

Communities of Interest

  • Autonomy
  • Biomedical
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Biomedical Research
  • Classification
  • Clustering
  • Coefficients
  • Computations
  • Computer Vision
  • Data Sets
  • Department Of Defense
  • Governments
  • Information Operations
  • Instructions
  • Measurement
  • Numbers
  • Observation
  • Operations Research

Readers

  • Forest Ecology
  • Regression Analysis.