Robust Guarantees for Black Box Methods

Abstract

Recent years have seen extraordinary advances in statistics and machine learning for tting modelsto complex high-dimensional data, allowing for accurate predictions in extremely challenging tasks. This results in powerful models and tools, but the complexity of these algorithms often means thatwe are eectively running a black box", where we can observe its behavior on data but are often unable to understand its parameters or internal construction. In this proposal we aim to design methods to give meaningful inference guarantees beyond the limited settings addressed by existing statistical theory. Our goal is to be able to assess the quality and accuracy of our tted models,without sacricing sample size, without constraining the algorithms used to t our models, and avoiding as much as possible any untestable assumptions on the data. Our key questions lie in two research areas: distribution-free prediction, and model-X inference.Distribution-free prediction aims to provide predictive intervals for an unknown response vari-able Y as a function of features X. The goal is to ensure a valid predictive interval (that is, intervalsthat are guaranteed to contain Y with the stated probability) without placing any assumptions on the distribution of the data, aside from requiring that the training and test points are drawn fromthe same distribution. For example, even if our models are built using linear regressions, we do not want predictive coverage to fail when the linear model is not true. The recent literature oers several powerful approaches for inference on a single model, but it is substantially more challenging to compare, or pool information across, multiple models; it is common to use cross-validation butlittle is known about the accuracy of such approaches. Thus, our rst key question is:(Q1) Given an unknown data distribution and black box" algorithms, what non-asymptotic, assumption-free guarantees can we provide for poolinomparing models? Additional challenges arise when the response Y is a binary label. If Y is inherently noisy then predictive inference is no longer meaningful for example, if at a particular X we can determine that the probability of a positive label is exactly 50%, then we cannot accurately predict Y even though we know its probability exactly. Our next key question is therefore: (Q2) In a binary regression problem, how can we provide meaningful distribution-free inference on underlying probabilities, rather than on the labels themselves? The second main research area is Model-X inference, where we aim to answer questions of con-ditional independence for example, is a given disease correlated with a particular genetic variant, after controlling for the patient s other genetic information and/or factors such as age or diet? The Model-X framework addresses applications where very little data is available on the responsevariable Y, and so we cannot condently place any assumptions such as a known type of model for Y s dependence on the feature X and confounders Z. Instead, it is assumed that we know the dependence of X on Z. Model-X methods work by creating fake data, constructing a knocko" copy of the feature of interest X, which replicates its dependence on the confounders Z but has nodirect dependence on Y. Thus under the null hypothesis of conditional independence, this copy acts as a control group for X. In practice, we cannot hope for perfect knowledge of the dependence of X on Z, and so these knocko copies are not constructed perfectly. This leadontrol, in the setting where our control group copies of the feature X are good copies for most, but not all, of the data points?

Document Details

Document Type
DoD Grant Award
Publication Date
May 08, 2020
Source ID
N000142012337

Entities

People

  • Rina Foygel Barber

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Chicago

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks
  • Biotechnology