Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Abstract

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets. This is a complete overview of the development of Shark, including design decisions, performance details, and comparison with existing data warehousing solutions. It demonstrates some of Shark's distinguishing features including its in-memory columnar caching and its unified machine learning interface.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2013
Accession Number
ADA577443

Entities

People

  • Clifford Engle

Organizations

  • University of California, Berkeley

Tags

Communities of Interest

  • Autonomy
  • Engineered Resilient Systems

DTIC Thesaurus Topics

  • Algorithms
  • Big Data
  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Data Analysis
  • Data Management
  • Data Sets
  • Data Warehousing
  • Databases
  • Information Science
  • Language
  • Learning
  • Machine Learning
  • Relational Database Management Systems
  • Relational Databases

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML