Inference after unsupervised learning

Abstract

Double-dipping is a major scientific problem. Researchers often collect data with the goal of hypothesis generation: they hope to find something #interesting#. They may then wish to test the hypothesis on the same data. This practice, known as #double-dipping#, is deeply problematic: for a classical hypothesis test to be valid, the hypothesis must be specified before looking at the data. By violating this principle, double-dipping leads to spurious results, e.g. in the form of vastly inflated Type 1 errors, and confidence intervals that fail to attain the nominal coverage. Double-dipping affects data analyses across virtually all areas of application. It is especially problematic when the hypothesis tested is generated using an unsupervised learning approach, since in that setting, sample splitting is often not applicable. This proposal takes a two-pronged approach to developing valid methods for statistical inference after double-dipping. First, in Aim 1, we consider a selective inference approach: this involves conducting a hypothesis test conditional on the event that we decided to test this particular hypothesis. We will develop approaches for testing data-driven hypothesis associated with well-known dimension reduction techniques. Second, we will consider a recent proposal, called #data thinning#, that enables a researcher to split a single random variable into two or more random variables that are independent and that belong to the same distributional family as the original random variable. While this approach has promise as a solution for double dipping, a number of important questions remain to be answered; we will work towards answering these questions in Aim 2. Finally, Aim 3will involve the development and release of software associated with the proposed approaches. All papers resulting from this work will be posted on arXiv and submitted to top-tier statistical methods journals, and fully-documented software will be made publicly and freely-available online.

Document Details

Document Type: DoD Grant Award
Publication Date: May 15, 2023
Source ID: N000142312589

Entities

People

Daniela Witten

Organizations

Office of Naval Research
United States Navy
University of Washington

Inference after unsupervised learning

Abstract

Document Details

Entities

People

Organizations

Tags

Readers

Technology Areas