Scalable Inference of Discrete Data: User Behavior, Networks and Genetic Variation

Abstract

Recent years have seen explosive growth in data, models and computation. Massive data sets and sophisticated probabilistic models are increasingly used in the fields of high-energy physics, biology genetics and in personalization applications; however, many statistical algorithms remain inefficient impeding scientific progress. In this thesis, we present several efficient statistical algorithms for learning from massive discrete data sets. We focus on discrete data because complex and structured activity such as chromosome folding in three dimensions, human genetic variation, social network interactions and product ratings are often encoded as simple matrices of discrete numerical observations. Our algorithms derive from a Bayesian perspective and lie in the framework of directed graphical models and mean- field variational inference. Situated in this framework, we gain computational and statistical efficiency through modeling insights and through subsampling informative data during inference. We begin with additive Poisson factorization models for recommending items to users based on user consumption or ratings. These models provide sparse latent representations of users and items and capture the long-tailed distributions of user consumption. We use them as building blocks for article recommendation models by sharing latent spaces across readership and article text. We demonstrate that our algorithms scale to massive data sets, are easy to implement and provide competitive user recommendations. Then, we develop a Bayesian nonparametric model in which the latent representations of users and items grow to accommodate new data. In the second part of the thesis, we develop novel algorithms for discovering overlapping communities in large networks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2015
Accession Number
ADA623497

Entities

People

  • Prem K. Gopalan

Organizations

  • Princeton University

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Bayesian Networks
  • Computational Science
  • Data Mining
  • Databases
  • Genetic Variation
  • Genetics
  • Information Processing
  • Information Retrieval
  • Information Science
  • Knowledge Management
  • Machine Learning
  • Monte Carlo Method
  • Network Science
  • Probabilistic Models
  • Social Media
  • Statistical Algorithms

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • Biotechnology
  • Space