Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs

Abstract

Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the 'outside world,' since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbitrarily long before logging and may also delay propagating knowledge of their logging or checkpointing progress to other processes. This paper describes a new transparent rollback-recovery method that adds very little overhead to distributed application programs and efficiently supports the quick commit of all output to the outside world. Each process can independently choose at any time either to use checkpointing alone (as in consistent checkpointing) or to use optimistic message logging. The system is based on a new commit algorithm that requires communication with and information about the minimum number of other processes in the system, and supports the recovery of both deterministic and nondeterministic processes.... Distributed systems, Fault tolerance, Rollback recovery, Optimistic message logging, Checkpointing, Output commit.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 1993
Accession Number
ADA268981

Entities

People

  • David B. Johnson

Organizations

  • Carnegie Mellon University

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Application Software
  • Computations
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Damage Detection
  • Distributed Computing
  • Fault Tolerance
  • Fault Tolerant Computing
  • Information Processing
  • Message Systems
  • Networks
  • Operating Systems
  • Software Development

Fields of Study

  • Computer science

Readers

  • Parallel and Distributed Computing.