Abstractions for Fault Tolerance in Distributed Systems.
Abstract
This paper describes abstractions that are useful in implementing fault-tolerant and distributed systems. The same abstractions serve both fault tolerance and distribution, supporting our belief that the two concerns are not separable. The abstractions we present have a somewhat different flavor from abstractions prevalent in sequential and concurrent programming, which encapsulate state information and/or provide operations to manipulate that state (e.g. a stack or a monitor). Our abstractions are best thought of as properties of protocols or control. Section 2 describes abstractions of processors that can fail and section 3 reviews some fundamentals for coping with failures. Section 4 discusses the state machine approach, a general way to construct fault-tolerant distributed computing services. The state machine approach motivates two abstractions: Agreement and Order. Section 5 discusses a second approach for constructing a fault-tolerant computing service and this motivates two more abstractions: Failure Detection and Stable Storage. Implementing abstractions by exploiting hardware is discussed in section 6. Section 7 discusses related work. Keywords: Fault tolerance; Distributed computing; State machines; Failure detection; Fail-stop processor.
Document Details
- Document Type
- Technical Report
- Publication Date
- Apr 01, 1986
- Accession Number
- ADA166549
Entities
People
- Fred B. Schneider
Organizations
- Cornell University