Understanding Inefficiencies in Data-Intensive Computing
Abstract
New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model's ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems--issues that will be faced by any DISC system built atop popular commodity hardware and OSs.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jan 01, 2012
- Accession Number
- ADA628031
Entities
People
- Elie Krevat
- Eric Anderson
- Gregory R. Ganger
- Jay J. Wylie
- Joseph Tucek
- Tomer Shiran
Organizations
- Carnegie Mellon University