Failure Recovery in Resilient X10
Abstract
Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming language adds failure containment and failure awareness to a general purpose, distributed programming language. A Resilient X10 application spans over a number of places. Its formal semantics precisely specify how it continues executing after a place failure. Thanks to failure awareness, the X10 programmer can in principle build redundancy into an application to recover from failures. In practice, however, correctness is elusive, as redundancy and recovery are often complex programming tasks.
Document Details
- Document Type
- Pub Defense Publication
- Publication Date
- Jul 02, 2019
- Source ID
- 10.1145/3332372
Entities
People
- Arun Iyengar
- Avraham Shinnar
- Benjamin Herta
- David Grove
- Josh Milthorpe
- Kiyokuni Kawachiya
- Mikio Takeuchi
- Olivier Tardieu
- Sara S. Hamouda
- Vijay Saraswat
Organizations
- Air Force Office of Scientific Research
- Australian National University
- Goldman Sachs
- International Business Machines Corporation (Armonk, NY)
- United States Department of Energy