Inference after unsupervised learning

Abstract

Double-dipping is a major scientific problem. Researchers often collect data with the goal of hypothesis generation: they hope to find something #interesting#. They may then wish to test the hypothesis on the same data. This practice, known as #double-dipping#, is deeply problematic: for a classical hypothesis test to be valid, the hypothesis must be specified before looking at the data. By violating this principle, double-dipping leads to spurious results, e.g. in the form of vastly inflated Type 1 errors, and confidence intervals that fail to attain the nominal coverage. Double-dipping affects data analyses across virtually all areas of application. It is especially problematic when the hypothesis tested is generated using an unsupervised learning approach, since in that setting, sample splitting is often not applicable. This proposal takes a two-pronged approach to developing valid methods for statistical inference after double-dipping. First, in Aim 1, we consider a selective inference approach: this involves conducting a hypothesis test conditional on the event that we decided to test this particular hypothesis. We will develop approaches for testing data-driven hypothesis associated with well-known dimension reduction techniques. Second, we will consider a recent proposal, called #data thinning#, that enables a researcher to split a single random variable into two or more random variables that are independent and that belong to the same distributional family as the original random variable. While this approach has promise as a solution for double dipping, a number of important questions remain to be answered; we will work towards answering these questions in Aim 2. Finally, Aim 3will involve the development and release of software associated with the proposed approaches. All papers resulting from this work will be posted on arXiv and submitted to top-tier statistical methods journals, and fully-documented software will be made publicly and freely-available online.

Document Details

Document Type
DoD Grant Award
Publication Date
May 15, 2023
Source ID
N000142312589

Entities

People

  • Daniela Witten

Organizations

  • Office of Naval Research
  • United States Navy
  • University of Washington

Tags

Readers

  • Distributed Systems and Data Platform Development
  • Regression Analysis.
  • Theoretical Analysis.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks