Unreliable Failure Detectors for Reliable Distributed Systems

Abstract

It is well-known that Consensus, a fundamental problem of fault- tolerant distributed computing, cannot be solved in asynchronous systems with crash failures. This impossibility result stems from the lack of reliable failure detection in such systems. To circumvent such impossibility results, we introduce the concept of unreliable failure detectors that can make mistakes, and study the problem of using them to solve Consensus. We characterize unreliable failure detectors by two types of properties: completeness and accuracy. Informally, completeness requires that the failure detector eventually suspects every process that actually crashes, while accuracy restricts the mistakes that it can make. We define a hierarchy of failure detectors based on the strength of their accuracy. We determine which failure detectors in this hierarchy can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We show that Consensus can be solved with weak failure detectors, i.e., failure detectors that make an infinite number of mistakes. This leads to the following question: What is the weakest failure detector for solving Consensus? In a companion paper, we show that OW, one of the failure detector that we consider here, is the weakest failure detector for solving Consensus in asynchronous systems. In this paper, we show that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems. Thus, all our results apply to Atomic Broadcast as well.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 1993
Accession Number
ADA269016

Entities

People

  • Sam Toueg
  • Tushar D. Chandra

Organizations

  • Cornell University

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Algorithms
  • Application Software
  • Asynchronous Systems
  • Computations
  • Computer Networks
  • Computer Science
  • Computers
  • Consensus Algorithms
  • Damage Detection
  • Detection
  • Detectors
  • Distributed Computing
  • Fault Tolerance
  • Fault Tolerant Computing
  • Hierarchies
  • Operating Systems

Fields of Study

  • Engineering

Readers

  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Theoretical Analysis.