Distributed Statistical Machine Learning via Concurrency Control
Abstract
Project Summary One of the grand challenges in modern computing is the design of data-analysis systems that scale to extremely large collections of data, simultaneously providing control over statistical error rates and control over computational resources such as runtime. Achieving bounds that are not merely correct but are useful in practice, particularly for problems at extreme scales, requires exploring parallel and distributed computing architectures. We tackle this challenge by taking concurrency-control ideas from the database community as a point of departure, adapting the concurrency-control paradigm to the needs of large-scale statistical inference. We focus on problems involving clustering and other combinatorial tasks, given the heterogeneity present in large-scale data sets and the combinatorial nature of distributed computing architectures. We propose two main threads of research: the first involving the highly-scalable paradigm of correlation clustering, and the second involving hierarchical Bayesian nonparametric models. Most existing work in these areas involves sequential algorithms that run on a single machine. We will develop parallel and distributed computational models for these tasks, implement them in a distributed computing environment and analyze the models both theoretically and empirically.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Aug 12, 2016
- Source ID
- N000141512670
Entities
People
- Michael I. Jordan
Organizations
- Office of Naval Research
- United States Navy
- University of California Regents