Statistical Tools for Reproducible Selections
Abstract
Statement of Work:In many fields of science, we observe a response variable (e.g. whether or not an individual is carrying a disease) together with a large number of potential explanatory variables (e.g. variations in a single nucleotide at hundreds of thousands of locations on the genome), and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR) is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. Short of guarantees of this kind, the error rateamong published claims may be very high, leading even to current speculation that ``most published research findings are false , compare the enormous attention the replicability crisis is receiving both in the scientific community and media outlets. This project is about the development of novel variable selection procedure controlling the FDR in a variety of statistical models including all generalized linear models. Our goal is on methods that achieve exact FDR control in finite sample settings no matter the value of the unknown parameters being tested (e.g. the regression parameters).Objective:Objectives are three-fold:* Theory and methods. Development of statistical procedures controlling a variety of type-I errors as well of novel mathematics rigorously establishing their correctness. Testing of these procedure on simulated data to assess their empirical performance.*Applications. Testing of our methods on real data sets mostly from the life sciences. Examples include genome wide association study (GWAS) data sets, and new gene expression data measured at the cell level.* Software tools. Development of professional quality software in MATLAB and R, allowing researchers to reproduce our findings and apply our methods to a host of scientific problems of contemporary interest.Approach:In order to achieve the goals of the proposal, we propose extending recent work on knockoff filters, an innovative methodology that is currently limited to linear regression models with essentially fewer parameters than observations. Of special concern is the possibility offered by random designs, where the challenge is to find methods for constructing `fake variables, which can be interchanged with `null variables having no effect on the response of interest. In this project, these fake variables shall serve as controls in that they will be used to assess the statistical significance of individual findings, thereby calibrating the selection procedure and ensuring a form of reproducibility.Overall Merit and ONR Mission/Relevance:We are in the midst of a global scientific crisis, where science does not seem to always self correct. The tools from this proposal will help in sharpening the ability of science to be self-correcting. It will help researchers to identify promising leads and reliable effects, which can be reproduced, as opposed to effects that quickly vanish and cannot be confirmed by followup studies. (The cost of irreproducibility to society is quite high.) Our tools should be of interest to the Navy to the extent that the Navy is concerned by the cost, the waste of time and resources, associated with following false leads and promises. Additionally, the world is rapidly changing in that (big) data and algorithms ever inform important decisions and actions. Clearly, the Navy is heavily vested in this. When we have large amounts of data, it is easy to over-fit, over-interpret and take a series of actions merely by chance, which are neither robust nor warranted. Now, yes or no actions can often be modeled via logistic regression and, therefore, the tools from this proposal avoid such pitfalls and lead to credible and reliable decisions and actions.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Aug 12, 2016
- Source ID
- N000141612712
Entities
People
- Emmanuel Candès
Organizations
- Office of Naval Research
- Stanford University
- United States Navy