Checkpointing and Rollback-Recovery for Distributed Systems

Abstract

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process restarts after a failure, a minimal number of additional processes are forced to restart with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Oct 01, 1985
Accession Number
ADA161126

Entities

People

  • Richard Koo
  • Sam Toueg

Organizations

  • Cornell University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Communication Networks
  • Computations
  • Computer Networks
  • Computer Science
  • Computers
  • Consistency
  • Contrast
  • Databases
  • Determinants (Mathematics)
  • Explosives Initiators
  • Fault Tolerance
  • Fault Tolerant Computing
  • Operating Systems
  • Recovery
  • Reliability
  • Software Development

Fields of Study

  • Computer science
  • Engineering

Readers

  • Mathematical Modeling and Probability Theory.
  • Parallel and Distributed Computing.