Unreliable Failure Detectors for Reliable Distributed Systems
Abstract
It is well-known that Consensus, a fundamental problem of fault- tolerant distributed computing, cannot be solved in asynchronous systems with crash failures. This impossibility result stems from the lack of reliable failure detection in such systems. To circumvent such impossibility results, we introduce the concept of unreliable failure detectors that can make mistakes, and study the problem of using them to solve Consensus. We characterize unreliable failure detectors by two types of properties: completeness and accuracy. Informally, completeness requires that the failure detector eventually suspects every process that actually crashes, while accuracy restricts the mistakes that it can make. We define a hierarchy of failure detectors based on the strength of their accuracy. We determine which failure detectors in this hierarchy can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We show that Consensus can be solved with weak failure detectors, i.e., failure detectors that make an infinite number of mistakes. This leads to the following question: What is the weakest failure detector for solving Consensus? In a companion paper, we show that OW, one of the failure detector that we consider here, is the weakest failure detector for solving Consensus in asynchronous systems. In this paper, we show that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems. Thus, all our results apply to Atomic Broadcast as well.
Document Details
- Document Type
- Technical Report
- Publication Date
- Aug 01, 1993
- Accession Number
- ADA269016
Entities
People
- Sam Toueg
- Tushar D. Chandra
Organizations
- Cornell University