Health Maintenance System: An Application of Recovery Oriented Computing for HPEC Systems

Abstract

Until recently, the single aspect of HPEC systems that has been most critical has been "performance," in terms of processor speeds and I/O throughput. As processor speeds and I/O throughput have continued to increase, and as the capability to build larger and larger systems has improved, the need for raw performance is becoming less critical. Now, it is the ability to achieve a high level of application availability that is becoming as critical as performance. In this paper, the author presents a CORBA-based framework upon which highly available applications can be constructed. This framework, known as the Health Maintenance System, provides the application, system managers, and management tools that have the ability to "manage" all resources within a system such that the "health" of the system can be maintained. The management of these resources involves the ability to "sense" the state of the resource, to control the resource, and to run tests on the resource to pro-actively detect any latent problems. The primary facet of the framework is the "resource manager." The resource manager provides local management support for all system resources. In addition, the resource manager provides management access to clients (e.g., the application). This access is provided via a set of "client interface" modules that provide a wide variety of interfaces (e.g., APIs, agents, etc). It is this combination of resource managers and client interface modules that allow the framework to be easily configured for a specific HPEC system. Ten briefing charts summarize the presentation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Aug 20, 2004
Accession Number
ADA428761

Entities

People

  • Gerry Pocock

Tags

DTIC Thesaurus Topics

  • Abstracts
  • Availability
  • Computers
  • Failure Mode And Effect Analysis
  • Information Operations
  • Instrumentation
  • Low Density
  • Maintenance
  • Middleware
  • Monitoring
  • Recovery
  • Redundancy
  • Reliability
  • Standards
  • Throughput

Fields of Study

  • Computer science

Readers

  • Database Systems and Applications
  • Distributed Systems and Data Platform Development
  • Logistics and Supply Chain Management.