Framework Design for Improving Computational Efficiency and Programming Productivity for Distributed Machine Learning

Abstract

Machine learning (ML) methods are used to analyze data in a wide range of areas, such as finance, e-commerce, medicine, science, and engineering, and the size of machine learning problems has grown very rapidly in terms of data size and model size in the era of big data. This trend drives industry and academic communities toward distributed machine learning that scales out ML training in a distributed system for completion in a reasonable amount of time. There are two challenges in implementing distributed machine learning: computational efficiency and programming productivity. The traditional data-parallel approach often leads to suboptimal training performance in distributed ML due to data dependencies among model parameter updates and nonuniform convergence rates of model parameters. From the perspective of an ML programmer, distributed ML programming requires substantial development overhead even with high-level frameworks because they require an ML programmer to switch to a different mental model for programming from a familiar sequential programming model. The goal of my thesis is to improve the computational efficiency and programming productivity of distributed machine learning. In an efficiency study, I explore model update scheduling schemes that consider data dependencies and nonuniform convergence speeds of model parameters to maximize convergence per iteration and present a runtime system STRADS that efficiently execute model update scheduled ML applications in a distributed system. Ina productivity study, I present familiar sequential-like programming API that simplifies conversion of a sequential ML program into a distributed program without requiring an ML programmer to switch to a different mental model for programming and implement a new runtime system STRADS-Automatic Parallelization (AP) that efficiently executes ML applications written in our API in a distributed system.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2018
Accession Number
AD1173977

Entities

People

  • Jin K. Kim

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Alzheimer Disease
  • Application Software
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Big Data
  • Cloud Computing
  • Computational Science
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Analysis
  • Data Mining
  • Data Processing
  • Information Processing
  • Information Science
  • Information Systems
  • Machine Learning
  • Natural Language Processing
  • Network Science
  • Neural Networks
  • Operating Systems
  • Supervised Machine Learning

Fields of Study

  • Computer science
  • Engineering

Readers

  • Aviation Safety and Air Traffic Management
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks