Graphical modeling of high-dimensional tabular data
Abstract
It is standard practice to collect data on many different categorical features, such as gender, ethnicity, religious affiliation, and participation in different organizations. It is also commonly of interest to infer latent categorical features, providing a grouping of individuals, organizations, etc. into different types. Such data are routinely collected by the DoD. Regardless of whether the categorical features of interest are latent or observed, in many settings it is critically important to infer how the different features are related. While simple measures of pairwise dependence, such as variants of correlation coefficients for categorical data,are useful, graphical models go a step further in allowing one to infer dependence networks that can even relate to causal relationships. The broad utility of graphical modeling is apparent from the rich and successful literature on Gaussian graphical models. Although there are extensions to relax the Gaussianity assumption for real-valued data vectors, the literature on graphical modeling ofcategorical data is remarkably sparse with state-of-the-art methods having poor performance even in simple low-dimensional simulation examples. This project develops a transformative paradigm for graphical modeling of categorical data. Focusing on observed categorical features, Objective 1 develops a broad new toolbox of Bayesian inference methods leveraging on innovative new classes of conjugate priors for the high-dimensional probabilities in a contingency table, generalizing broadly used Dirichlet priors to incorporategraphical structure prescribing conditional independence relationships in the data. These new classes of priors lead to highly efficient computation relative to traditional sparse log-linear models, while providing a characterization of uncertainty in inferences.The methods allow for known and unknown graphs and even allow testing of the presence of an edge in a graph. Focusing on latent categorical features underlying complex observed data, Objective 2 develops a broad class of latent ensemble methods for inferring interpretable latent structure and relationships across domains. For example, individuals may fall into one group with respect to their religious ideology and another group with respect to their socioeconomic background. It is reasonable to expect statistical dependence in cluster assignment across domains but there is currently no literature on flexibly modeling such dependence. This project develops a broad class of elegant, theoretically supported, and computationally convenient probabilistic modeling frameworks for solvingsuch problems. If successful, this project will develop a transformative toolbox of methods for analyzing relationships among (observed or latent) categorical features, leading to multiple high-impact publications in leading journals while fundamentally enhancingthe ability to effectively learn interpretable and actionable signals in the data by the DoD and broad scientific and technical communities.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Nov 09, 2024
- Source ID
- N000142412626
Entities
People
- David B. Dunson
Organizations
- Duke University
- Office of Naval Research
- United States Navy