Understanding Inefficiencies in Data-Intensive Computing

Abstract

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model's ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems--issues that will be faced by any DISC system built atop popular commodity hardware and OSs.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2012
Accession Number
ADA628031

Entities

People

  • Elie Krevat
  • Eric Anderson
  • Gregory R. Ganger
  • Jay J. Wylie
  • Joseph Tucek
  • Tomer Shiran

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical
  • C4I

DTIC Thesaurus Topics

  • Algorithms
  • Cloud Computing
  • Commodities
  • Computations
  • Computer Programming
  • Computers
  • Data Analysis
  • Data Centers
  • Data Processing
  • Fault Tolerance
  • Health Care
  • Information Science
  • Measurement
  • Operating Systems
  • Scheduling (Production)
  • Throughput
  • Workload

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.