Interactive Query Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads

Abstract

Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short and increasingly interactive jobs in addition to the large long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing (e.g. OLAP) is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We ll this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of SQL-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 02, 2012
Accession Number
ADA561769

Entities

People

  • Randy H. Katz
  • Sara Alspaugh
  • Yanpei Chen

Organizations

  • University of California, Berkeley

Tags

Communities of Interest

  • Energy and Power Technologies
  • Ground and Sea Platforms
  • Weapons Technologies

DTIC Thesaurus Topics

  • Big Data
  • Commerce
  • Computations
  • Computer Programming
  • Computer Science
  • Data Processing
  • Data Sets
  • Databases
  • Electrical Engineering
  • Electronic Commerce
  • Information Science
  • Measurement
  • Scheduling (Production)
  • Signal Processing
  • Storage
  • Urban Areas
  • Workload

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Distributed Systems and Data Platform Development
  • Economics