Sketching as a tool for statistical inference
Abstract
This proposal will address the problem of determining how to use sketching for statistical inference in modern data analysis setting s. Slightly more broadly, it will solve the problem of developing a statistical foundation for sketching. Our settings will include common problems such as linear regression and principal component analysis (PCA) with large data sizes, where it becomes extremely expensive to do data analysis. While sketching and other randomized matrix algorithms are widely used in numerical computation, at t he moment it is unknown how and when to use them for the important task of statistical inference, such as forming confidence interva ls and performing hypothesis tests.We will elucidate their scope by leveraging powerful recent mathematical tools from asymptotic ra ndom matrix theory and free probability theory, which have only recently begun to be used in the area, in the PI s work.The proposal is structured around three thrusts that address key sub-problems of the proposed problem. To understand the overall problem of how to use sketching for statistical inference, we will first investigate the possibly most important supervised learning problem in sta tistics, linear regression. Then, we will investigate one of the most important unsupervised learning problems, principal component analysis. These problems are intrinsically important and central to statistics and data science, and in addition we expect that the insights gained here will be valuable even in more general problems, such as non-linear regression and dimension reduction. Finally, more generally, we will address the ambitious problem of building an entirely new theory showing equivalences between sketching an d statistical inference.The proposed research may potentially vastly extend the reach and robustness of statistical inference method s, through the use of randomized matrix algorithms. Randomized matrix algorithms are at present not used for statistical inference, but our work could enable this breakthrough. More broadly, our research will start a flow of ideas between two almost disjoint areas , potentially leading to a great deal of progress.The importance of this work is also that it develops fundamental new algorithms an d theory for modern data science. We expect to have a broad impact across many application areas, fueled by the rising significance of data science and statistics. Massive, high-dimensional, and noisy datasets are becoming ubiquitous in all areas of human activity , including artificial intelligence (AI). However, our ability to analyze them with provable guarantees is often limited, especially when the amount of computation for analysis is small. Our research will develop critical theory and methods enabling rigorous large scale statistical inference.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Aug 20, 2021
- Source ID
- N000142112843
Entities
People
- Edgar Dobriban
Organizations
- Office of Naval Research
- United States Navy
- University of Pennsylvania