Understanding Inefficiencies in Data-Intensive Computing

Abstract

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model's ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems--issues that will be faced by any DISC system built atop popular commodity hardware and OSs.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 2012
Accession Number: ADA628031

Entities

People

Elie Krevat
Eric Anderson
Gregory R. Ganger
Jay J. Wylie
Joseph Tucek
Tomer Shiran

Organizations

Carnegie Mellon University

Understanding Inefficiencies in Data-Intensive Computing

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers