FAIL-SAFE: Fault Aware IntelLigent Software for Exascale
Abstract
The University of Southern California (USC), the Lawrence Livermore National Laboratory (LLNL), and the Jet Propulsion Laboratory (JPL) believe that a new generation of dependable applications must be developed to successfully exploit this next generation of technology. Such applications and the systems they run on must be introspective and adaptive, actively searching for errors in their program state with hardware mechanisms and new software techniques. Towards this end, we have developed and demonstrating the technology to enable adaptive, application-oriented control of fault tolerance, for a set of scientific applications on a workstation-class system by injecting memory faults and observing the survivability of the applications. We have defined an assertion language that provides programmer with a convenient interface to specific the resilient characteristics of applications and have implemented a limited set of these assertions as source-to-source transformations in the ROSE-compiler infrastructure. The outcomes of this research provide a model for the vendors of Defense systems, and a prototype capability should the vendors chose notto bring such technology to market. The increased application resilience resulting from this research will lead to faster completion of Defense applications, and thus substantial energy savings as well as increased mission assurance.
Document Details
- Document Type
- Technical Report
- Publication Date
- Jun 13, 2016
- Accession Number
- AD1024222
Entities
People
- Robert F. Lucas
Organizations
- University of Southern California