A Scalable Bootstrap for Massive Data

Abstract

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the ‘bag of little bootstraps’ (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

Document Details

Document Type
Pub Defense Publication
Publication Date
Mar 17, 2014
Source ID
10.1111/rssb.12050

Entities

People

  • Ameet Talwalkar
  • Ariel Kleiner
  • Michael I. Jordan
  • Purnamrita Sarkar

Organizations

  • Army Research Office
  • National Science Foundation
  • United States Army Research Laboratory
  • University of California, Berkeley

Tags

Fields of Study

  • Mathematics

Readers

  • Adaptive Control and Estimation with Uncertainty in Dynamic Systems.
  • Regression Analysis.
  • Systems Analysis and Design