Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

Abstract

In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino effect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work unifies existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 1988
Accession Number
ADA222559

Entities

People

  • David B. Johnson
  • Willy Zwaenepoel

Organizations

  • Rice University

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Consistency
  • Data Storage Systems
  • Fault Tolerance
  • Fault Tolerant Computing
  • Intervals
  • Iterations
  • Message Systems
  • Military Research
  • Operating Systems
  • Recovery
  • Software Development

Fields of Study

  • Computer science

Readers

  • Artificial Intelligence
  • Database Systems and Applications
  • Materials Science and Engineering.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • AI & ML - Machine Learning Algorithms