High-Level Fault Tolerance in Distributed Programs,

Abstract

We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severely consrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 1994
Accession Number
ADA290430

Entities

People

  • Adam Beguelin
  • Erik Seligman

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Damage Detection
  • Data Sets
  • Environment
  • Failure Mode And Effect Analysis
  • Fault Tolerance
  • High Performance Computing
  • Intervals
  • Iterations
  • Molecular Dynamics
  • Operating Systems
  • Preprocessing
  • Simulations
  • Time Intervals

Fields of Study

  • Computer science
  • Engineering

Readers

  • Parallel and Distributed Computing.