The Performance of Cache-Based Error Recovery in Multiprocessors

Abstract

Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache- based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.... Fault- tolerant computing, Cache-based checkpointing and rollback recovery, Shared- memory multiprocessors, Trace-driven simulation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 25, 1992
Accession Number
ADA259426

Entities

People

  • Bob Janssens
  • W. F. Fuchs

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Space

DTIC Thesaurus Topics

  • Algorithms
  • Circuits
  • Computations
  • Computers
  • Fault Tolerant Computing
  • Frequency
  • High Performance Computing
  • Instructions
  • Intervals
  • Kilobytes
  • Multiprocessors
  • Parallel Computing
  • Parallel Processing
  • Parallel Processors
  • Probability
  • Simulations
  • Simulators

Fields of Study

  • Computer science
  • Engineering

Readers

  • Computational Modeling and Simulation
  • Parallel and Distributed Computing.