Application-Transparent Fault Management.

Abstract

As computers continue to proliferate and they are used in more demanding environments, data integrity and continuous availability are an increasingly important aspect of their designs. Since operating systems are common to all computers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies and the mechanism that support them form an operating system's fault management. A fault management mechanism, the sentry mechanism, has been designed and implemented for a UNIX 4.3 BSD server running on the Mach 3.0 microkernal. Fault tolerant policies have been designed for a range of computer systems, from a single computer, to mirrored computers to distributed systems. The policies first addressed provide single computed applications with application-transparent fault tolerance with respect to transient faults and certain types of permanent faults. Contributions to this area include algorithms for concurrent process journaling, disk checkpointing and memory checkpointing. Formal proofs are made of the journal sequencing algorithm and the disk checkpointing algorithm. Performance measurements from am implementation of the single computer algorithms show an average performance overhead of less than 5% and a requirement of only 10 MB of dedicated disk stable storage. The system provides fault tolerance with no additional hardware other than a hard disk, and works with unmodified applications such as the X-window system. Sentry policies that provide software based fault tolerance for duplicated and triplicated computer systems as well as distributed systems have also been designed.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 01, 1994
Accession Number
ADA299747

Entities

People

  • Mark E. Russinovich

Organizations

  • Carnegie Mellon University

Tags

DTIC Thesaurus Topics

  • Algorithms
  • Commerce
  • Communications Protocols
  • Computer Programming
  • Computer Programs
  • Computers
  • Computing System Architectures
  • Data Compression
  • Debugging
  • Detection
  • Device Drivers
  • Fault Tolerance
  • Hash Tables
  • Identification
  • Lists (Data Structures)
  • Operating Systems
  • Time Intervals

Fields of Study

  • Computer science
  • Engineering

Readers

  • Database Systems and Applications
  • Fault Tolerant Diagnosis of Black and White Balloon Isolation Tests Using ¥.