Formal Techniques for Fault-Tolerance in Distributed Data Processing (DDP).
Abstract
Previous work on the logical specification of distributed systems has been continued. The basic approach involves specifying the logical functioning of the system in terms of an abstract state machine. Part of the fault-tolerance of the system is achieved by the reliability of this state machine. However, the problem of faulty inputs must be explicitly handled by the state machine. A fault-tolerant DDP system must handle malfunctioning components that give conflicting information to different parts of the system. This problem is expressed abstractly as the 'Byzantine Generals Problem', involving a group of generals communicating by messenger who must reach agreement upon a common battle plan despite the fact that one or more of them may be traitors trying to confuse the others. Several solutions are described and their use in implementing fault-tolerant DDP systems is discussed. Achieving fault-tolerance requires the detection and diagnosis of faults. Our work on the diagnosis of faulty nodes and communication links in a DDP system is described. In the event of a failure, one would like a DDP system to recover rapidly with a minimal loss of information. The value of a hierarchical design methodology in achieving a recovery capability is discussed.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 01, 1980
- Accession Number
- ADA084537
Entities
People
- Jack Goldberg
- Leslie Lamport
- Peter G. Neumann
- Robert E. Shostak
- William H. Kautz
Organizations
- SRI International