Mathematically Inspired Deep Representation Learning of Unlabeled Heterogeneous Data

Abstract

The deep learning literature has primarily focused on image and text data, which has revolutionized the fields of computer vision and natural language processing, respectively. While image and text are homogeneous data with spatial or temporal regularity, most real-world data are heterogeneous and unlabeled without such regularity. Therefore, deep learning still falls short of the state-of-the-art performance on heterogeneous data reported by traditional machine learning. Unlike traditional machine learning, the mathematical and statistical underpinnings of deep learning are poorly understood and sparingly studied in the literature. The theoretical opacity of deep models is compounded by overfitting, adversarial attacks, and uncertainty in data with missing values, especially when supervised learning is not feasible with unlabeled data. A general solution to these drawbacks is investigating new representation learning of heterogeneous data without using human annotations or data labels. A new paradigm of learning, known as self-supervised learning (SSL), has improved image representation learning without using data labels as supervisory signals, resulting in state-of-the-art image classification performance. However, SSL is yet to be explored in heterogeneous data that widely appear in tabular or matrix formats. The majority of real-world data collected are multivariate and stored in structured tables, marked by the widespread application of relational databases. The central hypothesis of this proposal is that SSL, guided by mathematical and statistical principles, can optimize and explain the performance of deep models with unlabeled heterogeneous data. The objectives for mathematically inspired SSL (MI-SSL) of unlabeled heterogeneous data are as follows. 1) Integrate complementary strengths of linear algebra, statistical learning, and deep learning for estimating missing values and uncertainty in heterogeneous data; 2) Investigate explainable methods to learn adaptive deep architectures for optimizing heterogeneous data representation; 3) Improve low-dimensional representation of unlabeled heterogeneous data via probabilistic deep cluster embedding; 4) Combine deep models with superior traditional machine learning in joint SSL frameworks to reap their complimentary benefits. The project is expected to deliver a set of new and robust learning algorithms and architectures for heterogeneous data, disseminated via conference presentations and journal publications. The outcomes of the project will benefit many scientists and practitioners seeking optimal representation learning of unlabeled data with missing values as they appear in the real world. Data-driven knowledge discovery skills are in demand almost everywhere in the national defense, medicine, and industry. The project is highly relevant to the research interest of the Department of Defense seeking Óscientific advancements in informatics, computation, and learning that support processing and making sense of dataÓ. The proposal is also posed to strengthen the existing data science graduate programs and data science research capacity of Tennessee State University.

Document Details

Document Type
DoD Grant Award
Publication Date
May 24, 2023
Source ID
W911NF2310170

Entities

People

  • Manar Samad

Organizations

  • Army Contracting Command
  • Office of the Secretary of Defense
  • Tennessee State University

Tags

Fields of Study

  • Computer science

Readers

  • Agricultural Chemistry/Soil Science
  • Artificial Intelligence
  • Distributed Systems and Data Platform Development

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks