Optimistic Execution and Checkpoint Comparison for Error Recovery in Parallel and Distributed Systems
Abstract
This paper describes a checkpoint comparison and optimistic execution technique for error detection and recovery in distributed and parallel systems. The approach is based on lookahead execution and rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. Two schemes derived from this strategy are analyzed and compared with triplication and voting, and with two common backward recovery methods. The impact of checkpoint time, checkpoint validation time. and process restart time is also examined. An implementation on a Sun NFS network with six benchmark programs is presented. Compared with classic checkpointing and rollback techniques, our strategy provides rapid recovery and requires, on average, fewer processors than standard replication and voting methods. This strategy is useful in systems where spare processors are available at the time of recovery. fault tolerant computing, checkpointing, error detection, and error recovery. error recovery.
Document Details
- Document Type
- Technical Report
- Publication Date
- May 08, 1992
- Accession Number
- ADA251925
Entities
People
- Jacob A. Abraham
- Junsheng Long
- W. Kent Fuchs
Organizations
- University of Illinois Urbana–Champaign