Design Insights for MapReduce from Diverse Production Workloads

Abstract

In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces comprise over a year's worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytes of I/O. Key observations include input data forms up to 77% of all bytes, 90% of jobs access KB to GB sized files that make up less than 16% of stored bytes, up to 60% of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 9:1 or greater, an average of 68% of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing compute and data bandwidth task durations range from seconds to hours, and five out of seven workloads contain map-only jobs. We have also deployed a public workload repository with workload replay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 25, 2012
Accession Number
ADA555881

Entities

People

  • Randy H. Katz
  • Sara Alspaugh
  • Yanpei Chen

Organizations

  • University of California, Berkeley

Tags

Communities of Interest

  • Engineered Resilient Systems

DTIC Thesaurus Topics

  • Commerce
  • Communication Systems
  • Computations
  • Computer Science
  • Data Analysis
  • Data Centers
  • Data Sets
  • Electrical Engineering
  • Electronic Commerce
  • Frequency
  • Measurement
  • Observation
  • Production
  • Scheduling (Production)
  • Standards
  • Storage
  • Workload

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Mathematics or Statistics
  • Parallel and Distributed Computing.