Divide-and-Conquer Bayesian Inference for Massive Space-time Data

Abstract

With the advancement of spatial referencing technologies such as Geographical Information Systems (GIS) and Global Positioning Systems (GPS) that can identify geographical coordinates, researchers in various disciplineshave gathered an unprecedented variety of geocoded data. Consequently, identifying spatial associations with sophisticated statistical models has become an enormously active area of research. Arguably, such data is bestanalyzed by hierarchical models capturing variation at multiple levels. Massive spatio-temporal data, both in terms of the number of locations and the number of observations per location, provide scientists with an unprecedentedopportunity to hypothesize and test complex relationships. This, in turn, leads to the implementation of rather complex hierarchical models that are computationally expensive even for moderately sized datasets. Such models faceinfeasibility in computation when the spatial locations become massive (say 1-100 million). This team recognizes the increased computational demands in terms of statistical modeling of large multivariate spatiotemporal data and proposes a Bayesian framework to tackle a wide variety of large geostatistical and point process datasets. It also takes advantage of embarrassingly parallelizable computation for large databases. Intellectual merit: This proposal sketches a comprehensive framework for carrying out hierarchical Bayesian statistical inference on massive spatiotemporal data. The focus of the proposal is methodological rather than purely theoretical or purely applied. Thus, innovative statistical methods and theory are developed to propose the Aggregated Monte Carlo (AMC) posterior as a theoretically justifiable approximation to the full data posterior. AMC posterior is obtained as the Wasserstein mean of posterior distributions on smaller subsets,referred to as subset posteriors, and provides scalable inference on massive databases. Novel theoretical results that will enhance development of AMC for rich and scalable spatio-temporal models will be explored, though AMC willalways be geared to help practitioners. The long-term goal of the PI is to develop a broad class of spatio-temporal models that would be able to answer key questions in a wide variety of experiments involving massive and complex datasets in ocean science, public health and so on. An important aspectof the proposed framework that makes it stand out is that it is model free in the sense that it can boost the scalability of any spatio-temporal model by multiple folds. In fact, the computation time of the proposed method vis-a-vis bothinferential and predictive performance, to the best of our knowledge, is exceptional. Broader impact: As the scientific community is moving into a data-driven era, there is an unprecedented opportunity to build understanding about how climate indicators change over time and space and will respond to changing environmental conditions. Additionally, there are large scale diseaseoccurrence data containing a plethora of small scale spatial variation and temporal trend which need to be explored for better public policy decisions. Although development of the proposed modeling framework is motivated by suchsubstantive questions in climatology, oceanography and public health, potential advancements in data modeling will be extremely useful in fields such as geoscience and engineering, where the fundamental goal is to use new findings to help improve society. The most appealing feature of the proposed framework is rapid computation that exploits parallel computer architecture. Further, the proposed development of open source software and associated learning material will make these methodological and computational advances accessible to researchers in applied fields. The proposed development can potentially revolutionarize practices with big data and enormously benefit army with quick and precise decision making from data rich environment.

Document Details

Document Type
DoD Grant Award
Publication Date
Jul 27, 2018
Source ID
N000141812741

Entities

People

  • Rajarshi Guhaniyogi

Organizations

  • Office of Naval Research
  • United States Navy
  • University of California, Santa Cruz

Tags

Readers

  • Distributed Systems and Data Platform Development
  • Regression Analysis.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • Space