Causal Inference with Multiple Environments
Abstract
The field of causal inference has primarily focused on methods for analyzing a single dataset. In many modern applied settings, however, we observe multiple related datasets. Examples: we may have datasets from many military bases, many states, or many groups of students in schools. We may have causal inference questions about individual datasets within the collection, or a question about the datasets as a group. In statistics, methods for simultaneously analyzing multiple datasets pays dividends. The multiplicity of datasets provides an opportunity for the "sharing of statistical strength", a hallmark of hierarchical modeling and empirical Bayes.But how can we leverage this multiplicity of datasets in causal inference? In the first research thrust, we build on the popular method of synthetic controls (SC). SC is a method for analyzing panel data. It is itself based on the idea of multiplicity. A control outcome is modeled as a function of other outcomes, and a set of full-control outcomes is used to learn that function. First, we observe that each SC problem in fact hides many, and we will exploit this idea to develop novel empirical Bayes procedures for SC. Second, we build on these ideas to a setting where we observe multiple related panels. Further we will benchmark SC in novel ways. In the second research thrust, we consider multi-environment data. An environment of data is a dataset produced under an intervention, either known or unknown. Our goal is to use the variation among environments, even if hidden, to untangle a causal signal. Inrecent years, ideas around multi-environment learning have sparked several lines of research. These works give hints that there areprincipled ways to analyze multiple environments. But these works have largely focused on simple regression, they blur the lines between assumptions and algorithms, they do not scale to large data, and they do not provide a cohesive theory. We will develop a principled probabilistic ML framework for analyzing multi-environment data. The result will be a holistic methodology for analyzing multiple related datasets, collected under unknown interventions, to infer their underlying causal mechanisms. We will extend our methods to learn a full causal graph, building on existing work that only learns direct parents of a single outcome. And we will use our probabilistic ML formulation to provide new theoretical understanding of what is estimable and not from multi-environment data. Theproposed research is relevant to DoD capabilities. Applied causality focuses on questions of the following form: "What will happen when I intervene in the world?" This is precisely the type of question that the DoD needs to answer to determine the efficacy of many of its activities. Indeed, while we often settle for "passive" prediction, most applications of machine learning tacitly hope to solve a causal inference problem. Furthermore, many DoD problems involve data from multiple environments---multiple groups of people, multiple sites of a training program, multiple missions. This proposal is about how to capitalize on this multiplicity to answer important questions and make better decisions. This abstract is approved for public release.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Mar 15, 2024
- Source ID
- N000142412243
Entities
People
- David M. Blei
Organizations
- Office of Naval Research
- Trustees of Columbia University in the City of New York
- United States Navy