Checkpointing and Rollback Recovery in Distributed Shared Memory Systems

Abstract

Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory systems (DSM) is expensive because of high frequency of communication. In this paper we show that, because of information redundancy, not all message-passing dependences need to be considered to roll back to a consistent state in DSM systems, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop a model of execution where client processes running an application interact atomically with a set of shared-memory server processes on every access to shared data. We show that under this model, dependences are significantly reduced over the message-passing model. We use results from simulation with multiprocessor address traces to demonstrate the reduction in dependences.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 1994
Accession Number
ADA281583

Entities

People

  • Bob Janssens
  • W. Kent Fuchs

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Space

DTIC Thesaurus Topics

  • Algorithms
  • Application Software
  • Computer Architecture
  • Computers
  • Consistency
  • Data Transmission
  • Determinants (Mathematics)
  • Directories
  • Fault Tolerance
  • Fault Tolerant Computing
  • Frequency
  • Models
  • Multiprocessors
  • Operating Systems
  • Recovery
  • Redundancy
  • Simulations

Fields of Study

  • Computer science
  • Engineering

Readers

  • Brain and Cognitive Science; Experimental Psychology; Cognitive Neuroscience
  • Parallel and Distributed Computing.