Distributed System Fault Tolerance Using Message Logging and Checkpointing

Abstract

Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. (KR)

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Dec 01, 1989
Accession Number: ADA222075

Entities

People

David B. Johnson

Organizations

Rice University

Distributed System Fault Tolerance Using Message Logging and Checkpointing

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas