Statistics meets Computation: Fundamental Trade-offs in High-Dimensional Problems
Abstract
Project Narrative: Statement of Objectives Recent advances in technology have led to the era of massive data sets, which we mean data sets that are not only larger, both in terms of sample size and dimensionality of the data, and but also more complex in terms of the types of data that can be collected. Massive and complex data sets present challenges that are both statistical and computational in nature. In the traditional approach to statistics, the issues of statistical accuracy of an estimator and the computational cost of implementing it have been considered separately. This approach is suitable to small-scale data sets, in which computation is not a limiting factor. However, large-scale data sets require an integrated approach to statistical and computational issues. Accordingly, motivated by the challenges of big data, the goal of the proposed research is to develop a principled framework for studying fundamental issues at the interface between statistics and computation. Our research proposal consists of several distinct but inter-related thrusts, as described briefly here and in more detail in the body of the proposal. For many problems, we have a range of possible statistical estimators at our disposal. They can be ordered in terms of computational complexity, ranging from simple heuristics (e.g., greedy algorithms, thresholding procedures and so on) to very complex, exponential-time heuristics (e.g., searching exhaustively over all subsets of predictors in performing variable selection). Given a range of possible procedures, we can then associate to each a pair of numbers corresponding to its computational cost and statistical efficiency. In general, one would expect (or at least hope) that a more computationally intensive method should lead to improved statistical performance (e.g., lower mean-squared error in prediction, or lower probability of error in testing). A rigorous characterization of the trade-off between computational and statistical efficiency would aid the user in allocating resources between algorithm design and data collection. A second research goal is to provide rigorous guarantees for statistical estimators that are based on solving nonconvex programs. Solving a generic nonconvex program is known to be computationally intractable in the general setting, but statistical settings lead to random ensembles of problems (as opposed to adversarially constructed instances). Moreover, statistical methods based on nonconvex programs often perform well in practice. Thus, an important open question is to provide a theoretical characterization of when such good behavior is to be expected. Another facet of computation concerns the distinction between decentralized and centralized procedures. For massive data sets, it is often not feasible to store all the data at a single site; rather, the data set needs to be broken into separate pieces and allocated to different processors. The processors then communicate locally in order to aggregate their estimates, thereby forming a global estimate. Estimators of this type raise a natural theoretical question: how much statistical efficiency is lost in the decentralized setting? The intended research is theoretical and will not result in any environmental impacts.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Aug 12, 2016
- Source ID
- N000141512354
Entities
People
- Martin J. Wainwright
Organizations
- Office of Naval Research
- United States Navy
- University of California Regents