Checkpointing and Error Recovery in Distributed Systems,
Abstract
This paper discusses some of the problems of producing fault tolerant distributed computer systems, in particular those of software error recovery. It shows how checkpoints may be used in error recovery, it defines the information that checkpoints must contain, and discusses alternate strategies for checkpointing. It describes models of error recovery and extends an existing recovery protocol to cater for certain types of checkpoint inconsistencies. The paper defines protocols for systematically generating checkpoints so that they can be used by the recovery protocols. It also defines a protocol for discarding checkpoints when they are no longer 'of use', which prevents the set of checkpoints growing indefinitely. The paper concludes by considering some of the problems of implementing the protocols. (Author)
Document Details
- Document Type
- Technical Report
- Publication Date
- Sep 01, 1980
- Accession Number
- ADA093463
Entities
People
- J. A. Mcdermid
Organizations
- Royal Signals and Radar Establishment