Applying Performance Models to Understand Data-Intensive Computing Efficiency

Abstract

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the optimal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3-13x longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results. To validate the sanity of our model, we present small-scale experiments with Hadoop and a simplified dataflow processing tool called Parallel DataSeries. Parallel DataSeries achieves performance close to the analytic optimal, showing that the model is realistic and that large improvements in the efficiency of parallel analytics are possible.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: May 01, 2010
Accession Number: ADA532848

Entities

People

Elie Krevat
Eric Anderson
Gregory R. Ganger
Jay J. Wylie
Joseph Tucek
Tomer Shiran

Organizations

Carnegie Mellon University

Applying Performance Models to Understand Data-Intensive Computing Efficiency

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers