Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs
Abstract
Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the 'outside world,' since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbitrarily long before logging and may also delay propagating knowledge of their logging or checkpointing progress to other processes. This paper describes a new transparent rollback-recovery method that adds very little overhead to distributed application programs and efficiently supports the quick commit of all output to the outside world. Each process can independently choose at any time either to use checkpointing alone (as in consistent checkpointing) or to use optimistic message logging. The system is based on a new commit algorithm that requires communication with and information about the minimum number of other processes in the system, and supports the recovery of both deterministic and nondeterministic processes.... Distributed systems, Fault tolerance, Rollback recovery, Optimistic message logging, Checkpointing, Output commit.
Document Details
- Document Type
- Technical Report
- Publication Date
- Mar 01, 1993
- Accession Number
- ADA268981
Entities
People
- David B. Johnson
Organizations
- Carnegie Mellon University