Clustering, Dimensionality Reduction, and Side Information

Abstract

Recent advances in sensing and storage technology have created many high-volume, high-dimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no well-defined notion of classes. The purpose of this thesis is to study some of the open problems in two main areas of unsupervised learning, namely clustering and (unsupervised) dimensionality reduction. Instance-level constraint on objects, an example of side-information, is also considered to improve the clustering results. Our first contribution is a modification to the isometric feature mapping (ISOMAP) algorithm when the input data, instead of being all available simultaneously, arrive sequentially from a data stream. ISOMAP is representative of a class of nonlinear dimensionality reduction algorithms that are based on the notion of a manifold. Both the standard ISOMAP and the landmark version of ISOMAP are considered. Experimental results on synthetic data as well as real world images demonstrate that the modified algorithm can maintain an accurate low-dimensional representation of the data in an efficient manner. We study the problem of feature selection in model-based clustering when the number of clusters is unknown. We propose the concept of feature saliency and introduce an expectation- maximization (EM) algorithm for its estimation. By using the minimum message length (MML) model selection criterion, the saliency of irrelevant features is driven towards zero, which corresponds to performing feature selection. The use of MML can also determine the number of clusters automatically by pruning away the weak clusters. The proposed algorithm is validated on both synthetic data and data sets from the UCI machine learning repository. We have also developed a new algorithm for incorporating instance-level constraints in model based clustering.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA496459

Entities

People

  • Hiu C. Law

Organizations

  • Michigan State University

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies
  • Space

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Automata Theory
  • Bayesian Networks
  • Computer Vision
  • Data Mining
  • Databases
  • Dimensionality Reduction
  • Factor Analysis
  • Information Science
  • Kernel Functions
  • Machine Learning
  • Network Science
  • Neural Networks
  • Pattern Recognition
  • Probabilistic Models
  • Supervised Machine Learning
  • Surveys

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Distributed Systems and Data Platform Development
  • Operations Research

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms