Checkpointing and Error Recovery in Distributed Systems,

Abstract

This paper discusses some of the problems of producing fault tolerant distributed computer systems, in particular those of software error recovery. It shows how checkpoints may be used in error recovery, it defines the information that checkpoints must contain, and discusses alternate strategies for checkpointing. It describes models of error recovery and extends an existing recovery protocol to cater for certain types of checkpoint inconsistencies. The paper defines protocols for systematically generating checkpoints so that they can be used by the recovery protocols. It also defines a protocol for discarding checkpoints when they are no longer 'of use', which prevents the set of checkpoints growing indefinitely. The paper concludes by considering some of the problems of implementing the protocols. (Author)

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 1980
Accession Number
ADA093463

Entities

People

  • J. A. Mcdermid

Organizations

  • Royal Signals and Radar Establishment

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Communication Systems
  • Computer Simulations
  • Computers
  • Data Transmission
  • Fail Safe
  • Fault Tolerance
  • High Level Languages
  • Language
  • Mainframe Computers
  • Maintenance
  • Models
  • Operating Systems
  • Packet Switching
  • Parallel Computing
  • Parallel Processing
  • Simulations

Fields of Study

  • Computer science

Readers

  • Approximation Theory.
  • Oncology
  • Systems Analysis and Design