Statistics meets Computation: Fundamental Trade-offs in High-Dimensional Problems

Abstract

Project Narrative: Statement of Objectives Recent advances in technology have led to the era of massive data sets, which we mean data sets that are not only larger, both in terms of sample size and dimensionality of the data, and but also more complex in terms of the types of data that can be collected. Massive and complex data sets present challenges that are both statistical and computational in nature. In the traditional approach to statistics, the issues of statistical accuracy of an estimator and the computational cost of implementing it have been considered separately. This approach is suitable to small-scale data sets, in which computation is not a limiting factor. However, large-scale data sets require an integrated approach to statistical and computational issues. Accordingly, motivated by the challenges of big data, the goal of the proposed research is to develop a principled framework for studying fundamental issues at the interface between statistics and computation. Our research proposal consists of several distinct but inter-related thrusts, as described briefly here and in more detail in the body of the proposal. For many problems, we have a range of possible statistical estimators at our disposal. They can be ordered in terms of computational complexity, ranging from simple heuristics (e.g., greedy algorithms, thresholding procedures and so on) to very complex, exponential-time heuristics (e.g., searching exhaustively over all subsets of predictors in performing variable selection). Given a range of possible procedures, we can then associate to each a pair of numbers corresponding to its computational cost and statistical efficiency. In general, one would expect (or at least hope) that a more computationally intensive method should lead to improved statistical performance (e.g., lower mean-squared error in prediction, or lower probability of error in testing). A rigorous characterization of the trade-off between computational and statistical efficiency would aid the user in allocating resources between algorithm design and data collection. A second research goal is to provide rigorous guarantees for statistical estimators that are based on solving nonconvex programs. Solving a generic nonconvex program is known to be computationally intractable in the general setting, but statistical settings lead to random ensembles of problems (as opposed to adversarially constructed instances). Moreover, statistical methods based on nonconvex programs often perform well in practice. Thus, an important open question is to provide a theoretical characterization of when such good behavior is to be expected. Another facet of computation concerns the distinction between decentralized and centralized procedures. For massive data sets, it is often not feasible to store all the data at a single site; rather, the data set needs to be broken into separate pieces and allocated to different processors. The processors then communicate locally in order to aggregate their estimates, thereby forming a global estimate. Estimators of this type raise a natural theoretical question: how much statistical efficiency is lost in the decentralized setting? The intended research is theoretical and will not result in any environmental impacts.

Document Details

Document Type: DoD Grant Award
Publication Date: Aug 12, 2016
Source ID: N000141512354

Entities

People

Martin J. Wainwright

Organizations

Office of Naval Research
United States Navy
University of California Regents

Statistics meets Computation: Fundamental Trade-offs in High-Dimensional Problems

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers