Trading Freshness for Performance in Distributed Systems

Abstract

Many data management systems are faced with a constant, high-throughput stream of updates. In some cases, these updates are generated externally: a data warehouse system must ingest a stream of external events and update its state. In other cases, they are generated by the application itself: large-scale machine learning frameworks maintain a global shared state, which is used to store the parameters of a statistical model. These parameters are constantly read and updated by the application. In many cases, there is a trade-off between the freshness of the data returned by read operations and the efficiency of updating and querying the data. For instance, batching many updates together will significantly improve the update throughput for most systems. However, batching introduces a delay between when an update is submitted and when it is available to queries. In this dissertation, I examine this trade-off in detail. I argue that systems should be designed so that the trade-off can be made by the application, not the data management system. Furthermore, this trade-off should be made at query time, on a per-query basis, not as a global configuration. To demonstrate this, I describe two novel systems. LazyBase is a data warehouse system originally designed for to store meta-data extracted from enterprise computer files, for the purposes of enterprise information management. It batches updates and processes them through a pipeline of transformations before applying them to the database, allowing it to achieve very high update throughput. The novel pipeline query mechanism in LazyBase allows applications to select their desired freshness at query time, potentially reading data that is still in the update pipeline and has not yet been applied to the final database. LazyTables is a distributed machine learning parameter server - a shared storage system for sparse vectors and matrices that make up the bulk of the data in many machine learning applications.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2014
Accession Number
ADA623640

Entities

People

  • James Cipar

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • Engineered Resilient Systems

DTIC Thesaurus Topics

  • Big Data
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Data Analysis
  • Data Management
  • Data Mining
  • Data Storage Systems
  • Hash Tables
  • Information Science
  • Information Systems
  • Machine Learning
  • Operating Systems
  • Parallel Computing
  • Social Media
  • Trees (Data Structures)

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks