Application Level Fault Tolerance in Heterogeneous Networks of Workstations.

Abstract

We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 1996
Accession Number
ADA319154

Entities

People

  • Adam Beguelin
  • Erik Seligman
  • Peter Stephan

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Damage Detection
  • Dynamics
  • Environment
  • Failure Mode And Effect Analysis
  • Fault Tolerance
  • Intervals
  • Linear Algebra
  • Lists (Data Structures)
  • Models
  • Molecular Dynamics
  • Preprocessing
  • Time Intervals
  • Virtual Machines

Fields of Study

  • Computer science
  • Engineering

Readers

  • Parallel and Distributed Computing.