Interactive data analysis with statistical guarantees
Abstract
The research in this proposal is related to valid statistical inference after interactive data analysis or data snooping. Data scientists and applied statisticians today have the luxury of untold numbers of tools to explore their data, though when it comes to reporting their findings, classical methods of statistics no longer have any guarantees. The failing of these methods plays at least some role in the reproducibility crisis in science. It has long been recognized in statistics that reporting the results of standard statistical procedures such as confidence intervals and $p$-values after data snooping or exploratory data analysis (EDA) generally has no theoretical justification. Breiman referred to this as the quiet scandal of statistics. On the other hand, according to Tukey: "the idea of a scientist struck, as if by lightning by a question is far from the truth." This tension between exploratory and confirmatory data analysis is at the heart of this proposal: scientists want to use data to generate hypotheses, but the scientific method requires objective evaluation of these hypotheses. Work in this proposal allows data scientists to snoop through their data but still provide statistical reports with rigorous statistical guarantees similar to those provided by classical methods when the classical methods are used appropriately. The main theoretical tool used here is the conditional approach to selective inference. Recent advances in this area provide a rigorous framework for inference after data snooping or model selection. Canonical examples include inference after choosing variables using the LASSO or forward stepwise model selection. While such examples are undoubtedly important, they are somewhat limited. In particular, much of the theoretical results make parametric assumptions which may not be valid in practice. Extending the theoretical results to cover nonparametric statistical models is one of the primary goals of this proposal. One of the appeals of the examples described above is that they are computationally tractable and can provide researchers with confidence intervals and p-values. Extending these parametric examples to nonparametric models and the randomized setting requires generally requires additional computation. The limiting distributions in these theoretical results are non-standard, generally requiring Monte Carlo to construct a reference distribution for evaluating p-values and constructing confidence intervals. Providing robust software to sample from these distributions is another goal of this proposal.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Oct 11, 2018
- Source ID
- W911NF1710416
Entities
People
- Jonathan E. Taylor
Organizations
- Army Contracting Command
- Stanford University
- United States Army