FAIL-SAFE: Fault Aware IntelLigent Software for Exascale

Abstract

The University of Southern California (USC), the Lawrence Livermore National Laboratory (LLNL), and the Jet Propulsion Laboratory (JPL) believe that a new generation of dependable applications must be developed to successfully exploit this next generation of technology. Such applications and the systems they run on must be introspective and adaptive, actively searching for errors in their program state with hardware mechanisms and new software techniques. Towards this end, we have developed and demonstrating the technology to enable adaptive, application-oriented control of fault tolerance, for a set of scientific applications on a workstation-class system by injecting memory faults and observing the survivability of the applications. We have defined an assertion language that provides programmer with a convenient interface to specific the resilient characteristics of applications and have implemented a limited set of these assertions as source-to-source transformations in the ROSE-compiler infrastructure. The outcomes of this research provide a model for the vendors of Defense systems, and a prototype capability should the vendors chose notto bring such technology to market. The increased application resilience resulting from this research will lead to faster completion of Defense applications, and thus substantial energy savings as well as increased mission assurance.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jun 13, 2016
Accession Number
AD1024222

Entities

People

  • Robert F. Lucas

Organizations

  • University of Southern California

Tags

Communities of Interest

  • C4I
  • Energy and Power Technologies
  • Engineered Resilient Systems
  • Space

DTIC Thesaurus Topics

  • Algorithms
  • C Programming Language
  • Computer Programming
  • Computer Programs
  • Computers
  • Department Of Defense
  • Energy Consumption
  • Fail Safe
  • Fault Tolerance
  • High Performance Computing
  • Jet Propulsion
  • Language
  • Management Personnel
  • Parallel Computing
  • Parallel Processing
  • Programming Languages
  • Reliability

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Software Engineering.
  • Systems Analysis and Design