Scheduling Message Processing for Reducing Rollback Propagation

Abstract

Traditional checkpointing and rollback recovery techniques for parallel systems have typically assumed the communication pattern is specified by program behavior. In this paper we exploit the property that the communication pattern can often be changed at runtime without affecting program correctness. A scheduling algorithm for message processing and its implementation for reducing rollback propagation are described. The algorithm incorporates a user-transparent prioritized scheme based upon the run-time communication and checkpointing history. Communication trace-driven simulation for several parallel programs written in the Chare Kernel language demonstrates that the probability of rollback propagation can be reduced at the cost of slight additional performance degradation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jul 01, 1992
Accession Number
ADA251908

Entities

People

  • W. K. Fuchs
  • Yi Min Wang

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Computers
  • Degradation
  • Engineering
  • Fault Tolerance
  • Fault Tolerant Computing
  • High Performance Computing
  • Intervals
  • Message Processing
  • Operating Systems
  • Parallel Computing
  • Parallel Processing
  • Probability
  • Recovery
  • Scheduling (Production)
  • Simulations
  • Software Development

Fields of Study

  • Computer science
  • Engineering

Readers

  • Parallel and Distributed Computing.