Integration of Clustering with Semantics Learning for Massive Categorical and Mixed Data

Abstract

Despite recent efforts, the challenge in clustering categorical and mixed data in the context of bigdata still remains due to the lack of inherently meaningful measure of similarity betweencategorical objects and the high computational complexity of existing clustering techniques.Specifically, most previous studies on clustering with categorical data have unfortunatelyneglected the semantic information hidden in relationships among categories in designing theirclustering algorithms. The ultimate goal of this project is to develop a novel methodology forintegration of clustering with semantics learning that enable us to discover and exploit hiddensemantics of data while effectively learning clusters from massive categorical/mixed data sets.Toward this goal, the project will focus on three key issues, namely (1) semantic informationdiscovery, (2) statistical consistency and interpretability and (3) computational efficiency andscalability. Semantic information discovery allows us to take into account not only thedistributions of categories but also their mutual relationships such as marginal/conditionaldependencies to develop data-driven similarity measures for categorical data.Statistical consistency and interpretability are important to ensure the relevance and significanceof the developed methodology. Our approach to this issue will be based on feature dependencies,feature selection and sparse learning [Witten & Tibshirani, 2010] for clustering.

Document Details

Document Type
DoD Grant Award
Publication Date
Sep 11, 2017
Source ID
FA23861714046

Entities

People

  • Van Nam Huynh

Organizations

  • Air Force Office of Scientific Research
  • Japan Advanced Institute of Science and Technology
  • United States Air Force

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Educational Psychology
  • Neural Network Machine Learning.