Model-Based Clustering and Data Transformations for Gene Expression Data

Abstract

Clustering is a useful exploratory technique for the analysis of gene expression data, and many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. Model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. This Gaussian mixture model has been shown to be a power tool for many applications. In addition, the issues of selecting a "good" clustering method and determining the "correct" number of clusters are reduced to model selection problems in the probability framework. We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has supeflor performance on our synthetic data sets, consistently selecting the correct model and the right number of clusters.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 30, 2001
Accession Number
ADA458752

Entities

People

  • Adrian Raftery
  • Alejandro Murua
  • Chris Fraley
  • Ka Y. Yeung
  • Walter L. Ruzzo

Organizations

  • George Washington University

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Computational Biology
  • Computational Science
  • Data Mining
  • Data Science
  • Data Sets
  • Databases
  • Distribution Functions
  • Fungi
  • Gene Expression
  • Information Science
  • Maximum Likelihood Estimation
  • Network Science
  • Normal Distribution
  • Proteins
  • Standards
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Computer Vision.
  • Statistical inference.