Divide and Recombine for Large Complex Data

Abstract

Divide and Recombine (D and R) statistical approach was developed for analyzing big data where the computational complexity is very high. The analyst divides data into subsets by a D and R division technique, applying analytic methods to each subset independently, without communication. Outputs of each analytic method are recombined by a D and R recombination procedure, which allows extensive parallel computation. DeltaRho software is the open-source implementation of D and R (see www.deltarho.org). Front end is the R package datadr, a language that makes programming D and R simple. At the back end running on a cluster, is a distributed database and parallel compute engine such as Hadoop, which spreads subsets and outputs across the cluster, and executes the analyst R and datadr code in parallel. The R package RHIPE provides communication between datadr and Hadoop. DeltaRho thus protects the analyst from having to manage the database and parallel computation. This research was performed under the XDATA program, to meet big data challenges by developing computational techniques and software tools for processing and analyzing vast amounts of mission-oriented information.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2017
Accession Number
AD1043385

Entities

People

  • Jeffrey Heer
  • Patrick Hanrahan
  • Ryan Hafen
  • William Cleveland

Tags

Communities of Interest

  • Autonomy
  • Biomedical
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Air Force
  • Air Force Research Laboratories
  • Big Data
  • Computational Complexity
  • Computer Graphics
  • Computer Programming
  • Computers
  • Data Analysis
  • Data Mining
  • Data Science
  • Data Visualization
  • Databases
  • Domain Specific Programming Languages
  • Information Science
  • Language
  • Parallel Computing
  • Programming Languages

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.