Graphical modeling of high-dimensional tabular data

Abstract

It is standard practice to collect data on many different categorical features, such as gender, ethnicity, religious affiliation, and participation in different organizations. It is also commonly of interest to infer latent categorical features, providing a grouping of individuals, organizations, etc. into different types. Such data are routinely collected by the DoD. Regardless of whether the categorical features of interest are latent or observed, in many settings it is critically important to infer how the different features are related. While simple measures of pairwise dependence, such as variants of correlation coefficients for categorical data,are useful, graphical models go a step further in allowing one to infer dependence networks that can even relate to causal relationships. The broad utility of graphical modeling is apparent from the rich and successful literature on Gaussian graphical models. Although there are extensions to relax the Gaussianity assumption for real-valued data vectors, the literature on graphical modeling ofcategorical data is remarkably sparse with state-of-the-art methods having poor performance even in simple low-dimensional simulation examples. This project develops a transformative paradigm for graphical modeling of categorical data. Focusing on observed categorical features, Objective 1 develops a broad new toolbox of Bayesian inference methods leveraging on innovative new classes of conjugate priors for the high-dimensional probabilities in a contingency table, generalizing broadly used Dirichlet priors to incorporategraphical structure prescribing conditional independence relationships in the data. These new classes of priors lead to highly efficient computation relative to traditional sparse log-linear models, while providing a characterization of uncertainty in inferences.The methods allow for known and unknown graphs and even allow testing of the presence of an edge in a graph. Focusing on latent categorical features underlying complex observed data, Objective 2 develops a broad class of latent ensemble methods for inferring interpretable latent structure and relationships across domains. For example, individuals may fall into one group with respect to their religious ideology and another group with respect to their socioeconomic background. It is reasonable to expect statistical dependence in cluster assignment across domains but there is currently no literature on flexibly modeling such dependence. This project develops a broad class of elegant, theoretically supported, and computationally convenient probabilistic modeling frameworks for solvingsuch problems. If successful, this project will develop a transformative toolbox of methods for analyzing relationships among (observed or latent) categorical features, leading to multiple high-impact publications in leading journals while fundamentally enhancingthe ability to effectively learn interpretable and actionable signals in the data by the DoD and broad scientific and technical communities.

Document Details

Document Type
DoD Grant Award
Publication Date
Nov 09, 2024
Source ID
N000142412626

Entities

People

  • David B. Dunson

Organizations

  • Duke University
  • Office of Naval Research
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms