Broader Frameworks for Distribution-Free Inference

Abstract

Approved for public release.In the era of powerful and large-scale machine learning methods, the field of statistics can play a keyrole in deploying these methods safely and reliably, by providing uncertainty quantification for state-of-the-art methods applied to real-world data. Can we provide estimates of our uncertainty that are valid without relying on unrealistic assumptions for the data distribution and/or learning algorithm? Finite-sample theoretical guarantees for such questions have only been studied more recently, forming the rapidly growing field of distribution-free inference. This field has provided a theoretically rigorous framework forquantifying uncertainty in prediction intervals: providing a confidence interval around our prediction of a response Y as a function of features X. However, these results have some important limitations: they cannot address questions such as conditionally valid coverage (i.e., coverage of Y given a particular value of X#to avoid issues where, for certain types of individuals X, coverage is too low), coverage of other quantities (such as the conditional mean of Y given X), or conditional independence testing. Indeed, thesemore complex inference questions have been proved to be impossible to address in a completely assumption-free setting.The central theme of this proposal is the aim of reconciling hardness results with practical goals. We seek to reframe our inference questions todetermine the strongest inference guarantees that we can provide without untestable assumptions, and uncover new insights on the interplay between distributional assumptions, algorithmic properties, and inference targets. Towards this end, the first main questionof the proposal is the following:(Q1) For various inference targets for which distribution-free inference is known to be impossible, what meaningful relaxations of these inference targets would make distribution-free inference possible?Moreover, in certain applications, data is gathered in more structured ways, such as through strat-ified sampling, repeated measurements, weighted sampling, etc. Under these frameworks, existing distribution-free methodology may need to be modified to be applicable#and indeed, stronger guarantees may be possible by leveraging the sampling structure. This leads to our next question:(Q2) In structured sampling settings, can we develop sharp methods for conditionally valid predictive inference, and for inference on regression? Moreover, for settings where the sampling design is controlled by the analyst, given a particular budget what types of sampling structures are optimal in terms of accuracy of the resulting inference?Finally, in addition to testing questions about the data distribution itself, it is often of great interest to determine properties of the learning algorithm#whether it satisfies stability conditions, and related questionssuch as privacy, low risk, and other desirable properties. Existing formulations of algorithmicstability are themselves known to beimpossible to test in an assumption-free framework, but this condition is needed since it enables estimation of risk, guarantees ofpredictive coverage, and other important downstream tasks. Therefore, the final research aim of this proposal is the following question:(Q3) What is a property of learning algorithms that captures the notion of algorithmic stability, and can provably be validatedempirically?This will allow us to develop alternative notions of algorithmic stability that do not require untestable assumptions.

Document Details

Document Type
DoD Grant Award
Publication Date
Nov 08, 2024
Source ID
N000142412544

Entities

People

  • Rina Foygel Barber

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Chicago

Tags

Fields of Study

  • Computer science

Readers

  • Regression Analysis.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms