Workload-Driven Design and Evaluation of Large-Scale Data-Centric Systems

Abstract

Large-scale data-centric systems help organizations store, manipulate, and derive value from large volumes of data. They consist of distributed components spread across a scalable number of connected machines and involve complex software/hardware stacks with multiple semantic layers. These systems help organizations solve established prob- lems involving large amounts of data, while catalyzing new, data-driven businesses such as search engines, social networks, and cloud computing and data storage service providers. The complexity, diversity, scale, and rapid evolution of large-scale data-centric systems make it challenging to develop intuition about these systems, gain operational expe- rience, and improve performance. It is an important research problem to develop a method to design and evaluate such systems based on the empirical behavior of the tar- geted workloads. Using an unprecedented collection of nine industrial workload traces of business-critical large-scale data-centric systems, we develop a workload-driven design and evaluation method for these systems and apply the method to address previously unsolved design problems. Speci cally, the dissertation contributes the following 1. A conceptual framework of breaking down workloads for large-scale data-centric systems into data access patterns, computation patterns, and load arrival patterns. 2. A workload analysis and synthesis method that uses multi-dimensional, non- parametric statistics to extract insights and produce representative behavior. 3. Case studies of workload analysis for industrial deployments of MapReduce and en- terprise network storage systems, two examples of large-scale data-centric systems. 4. Case studies of workload-driven design and evaluation of an energy-e cient MapRe- duce system and Internet datacenter network transport protocol pathologies, two research topics that require workload-speci c insights to address. Overall, the dissertation develops a more objective and systematic u

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 09, 2012
Accession Number
ADA561684

Entities

People

  • Yanpei Chen

Organizations

  • University of California, Berkeley

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Application Software
  • Commerce
  • Computer Programming
  • Computer Science
  • Computers
  • Data Analysis
  • Data Centers
  • Data Science
  • Data Storage Systems
  • Databases
  • Electrical Engineering
  • Energy Consumption
  • Information Science
  • Network Science
  • Operating Systems
  • Parallel Computing
  • Transport Protocols

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Mycotoxin ecology in Amazonian ecosystems.