Divide and Recombine for Large Complex Data
Abstract
Divide and Recombine (D and R) statistical approach was developed for analyzing big data where the computational complexity is very high. The analyst divides data into subsets by a D and R division technique, applying analytic methods to each subset independently, without communication. Outputs of each analytic method are recombined by a D and R recombination procedure, which allows extensive parallel computation. DeltaRho software is the open-source implementation of D and R (see www.deltarho.org). Front end is the R package datadr, a language that makes programming D and R simple. At the back end running on a cluster, is a distributed database and parallel compute engine such as Hadoop, which spreads subsets and outputs across the cluster, and executes the analyst R and datadr code in parallel. The R package RHIPE provides communication between datadr and Hadoop. DeltaRho thus protects the analyst from having to manage the database and parallel computation. This research was performed under the XDATA program, to meet big data challenges by developing computational techniques and software tools for processing and analyzing vast amounts of mission-oriented information.
Document Details
- Document Type
- Technical Report
- Publication Date
- Dec 01, 2017
- Accession Number
- AD1043385
Entities
People
- Jeffrey Heer
- Patrick Hanrahan
- Ryan Hafen
- William Cleveland