Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection

Abstract

Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n⩾p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn's disease in the UK, making twice as many discoveries as the original analysis of the same data.

Document Details

Document Type
Pub Defense Publication
Publication Date
Jan 08, 2018
Source ID
10.1111/rssb.12265

Entities

People

  • Emmanuel Candès
  • Jinchi Lv
  • Lucas Janson
  • Yingying Fan

Organizations

  • National Institutes of Health
  • National Science Foundation
  • Office of Naval Research
  • Simons Foundation
  • Stanford University
  • University of Southern California

Tags

Fields of Study

  • Computer science
  • Mathematics

Readers

  • Distributed Systems and Data Platform Development
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms