Distributed System Fault Tolerance Using Message Logging and Checkpointing

Abstract

Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. (KR)

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 1989
Accession Number
ADA222075

Entities

People

  • David B. Johnson

Organizations

  • Rice University

Tags

Communities of Interest

  • Energy and Power Technologies
  • Space

DTIC Thesaurus Topics

  • Algorithms
  • Application Software
  • Computer Networks
  • Computer Programming
  • Computer Science
  • Computers
  • Data Storage Systems
  • Data Transmission
  • Digital Communications
  • Distributed Computing
  • Fault Tolerance
  • Fault Tolerant Computing
  • Network Protocols
  • Operating Systems
  • Parallel Computing
  • Servers (Computer Hardware)
  • Software Development

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML