EnsCat: clustering of categorical data via ensembling

Abstract

Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach.

Document Details

Document Type
Pub Defense Publication
Publication Date
Sep 15, 2016
Source ID
10.1186/s12859-016-1245-9

Entities

People

  • Bertrand S. Clarke
  • Jennifer Clarke
  • Saeid Amiri

Organizations

  • Defense Threat Reduction Agency
  • National Science Foundation Directorate for Mathematical & Physical Sciences

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Regression Analysis.

Technology Areas

  • AI & ML