Coordinated Checkpointing Without Direct Coordination

Abstract

Coordinated checkpointing is a well known method to achieve fault tolerance in distributed systems. Long running parallel applications and high availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure free overheads, and high availability applications require fast and bounded recoveries. in this paper, we describe a new coordinated checkpoint protocol capable of satisfying both types of applications. The protocol uses time to avoid all types of direct coordination (e.g., message & changes and message tagging), reducing the overheads to almost a minimum. To ensure that rapid recoveries can be attained, the protocol guarantees small checkpoint latencies. The protocol was implemented and tested on a cluster of workstations connected by a 155 Mbit/sec ATM. Experimental results show that the protocol overheads are very small.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 1998
Accession Number
ADA348851

Entities

People

  • Nuno Neves
  • W. Kent Fuchs

Organizations

  • Purdue University

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Communication Channels
  • Computers
  • Consistency
  • Electrical Engineering
  • Engineering
  • Genetic Algorithms
  • Guarantees
  • Intervals
  • Military Research
  • Networks
  • Operating Systems
  • Recovery
  • Scheduling (Production)
  • Sequences
  • Technical Information Centers
  • Transport Protocols

Fields of Study

  • Computer science

Readers

  • Parallel and Distributed Computing.