Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

Abstract

The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. in the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In the paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique--passive replication--is explored in detail as it forms the basis for fault tolerance in the Paralex parallel programming environment.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 07, 1991
Accession Number
ADA245510

Entities

People

  • Ozalp Babaoglu

Organizations

  • Cornell University

Tags

DTIC Thesaurus Topics

  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Contracts
  • Demographic Cohorts
  • Distributed Computing
  • European Communities
  • Failure Mode And Effect Analysis
  • Fault Tolerance
  • Notation
  • Parallel Computing
  • Parallel Processing
  • Reliability
  • Software Development
  • United States

Fields of Study

  • Computer science
  • Engineering

Readers

  • Fault Tolerant Diagnosis of Black and White Balloon Isolation Tests Using ¥.
  • Parallel and Distributed Computing.
  • Systems Analysis and Design