Asynchronous Distributed Estimation of Topic Models for Document Analysis

Abstract

Given the prevalence of large data sets and the availability of inexpensive parallel computing hardware, there is significant motivation to explore distributed implementations of statistical learning algorithms. In this paper, we present a distributed learning framework for Latent Dirichlet Allocation (LDA), a well-known Bayesian latent variable model for sparse matrices of count data. In the proposed approach, data are distributed across P processors, and processors independently perform inference on their local data and communicate their sufficient statistics in a local asynchronous manner with other processors. We apply two different approximate inference techniques for LDA, collapsed Gibbs sampling and collapsed variational inference, within a distributed framework. The results show significant improvements in computation time and memory when running the algorithms on very large text corpora using parallel hardware. Despite the approximate nature of the proposed approach, simulations suggest that asynchronous distributed algorithms are able to learn models that are nearly as accurate as those learned by the standard non-distributed approaches. We also find that our distributed algorithms converge rapidly to good solutions.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2010
Accession Number
ADA539888

Entities

People

  • Arthur Asuncion
  • Max Welling
  • Padhraic Smyth

Organizations

  • University of California, Irvine

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Algorithms
  • Bayesian Networks
  • Computational Science
  • Computations
  • Computer Networks
  • Data Science
  • Data Sets
  • Distance Learning
  • Information Processing
  • Information Retrieval
  • Information Science
  • Machine Learning
  • Monte Carlo Method
  • Network Science
  • Probability
  • Sampling
  • Statistical Analysis

Fields of Study

  • Computer science

Readers

  • Parallel and Distributed Computing.
  • Statistical inference.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms