Distributed Statistical Machine Learning via Concurrency Control

Abstract

Project Summary One of the grand challenges in modern computing is the design of data-analysis systems that scale to extremely large collections of data, simultaneously providing control over statistical error rates and control over computational resources such as runtime. Achieving bounds that are not merely correct but are useful in practice, particularly for problems at extreme scales, requires exploring parallel and distributed computing architectures. We tackle this challenge by taking concurrency-control ideas from the database community as a point of departure, adapting the concurrency-control paradigm to the needs of large-scale statistical inference. We focus on problems involving clustering and other combinatorial tasks, given the heterogeneity present in large-scale data sets and the combinatorial nature of distributed computing architectures. We propose two main threads of research: the first involving the highly-scalable paradigm of correlation clustering, and the second involving hierarchical Bayesian nonparametric models. Most existing work in these areas involves sequential algorithms that run on a single machine. We will develop parallel and distributed computational models for these tasks, implement them in a distributed computing environment and analyze the models both theoretically and empirically.

Document Details

Document Type
DoD Grant Award
Publication Date
Aug 12, 2016
Source ID
N000141512670

Entities

People

  • Michael I. Jordan

Organizations

  • Office of Naval Research
  • United States Navy
  • University of California Regents

Tags

Fields of Study

  • Computer science

Readers

  • Computational Fluid Dynamics (CFD)
  • Parallel and Distributed Computing.
  • Statistical inference.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms