Unreliable Failure Detectors for Reliable Distributed Systems

Abstract

It is well-known that Consensus, a fundamental problem of fault- tolerant distributed computing, cannot be solved in asynchronous systems with crash failures. This impossibility result stems from the lack of reliable failure detection in such systems. To circumvent such impossibility results, we introduce the concept of unreliable failure detectors that can make mistakes, and study the problem of using them to solve Consensus. We characterize unreliable failure detectors by two types of properties: completeness and accuracy. Informally, completeness requires that the failure detector eventually suspects every process that actually crashes, while accuracy restricts the mistakes that it can make. We define a hierarchy of failure detectors based on the strength of their accuracy. We determine which failure detectors in this hierarchy can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We show that Consensus can be solved with weak failure detectors, i.e., failure detectors that make an infinite number of mistakes. This leads to the following question: What is the weakest failure detector for solving Consensus? In a companion paper, we show that OW, one of the failure detector that we consider here, is the weakest failure detector for solving Consensus in asynchronous systems. In this paper, we show that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems. Thus, all our results apply to Atomic Broadcast as well.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Aug 01, 1993
Accession Number: ADA269016

Entities

People

Sam Toueg
Tushar D. Chandra

Organizations

Cornell University

Unreliable Failure Detectors for Reliable Distributed Systems

Abstract

Document Details

Entities

People

Organizations

Tags

DTIC Thesaurus Topics

Fields of Study

Readers