Runtime Systems and Programming Tools for Resilient Computing with Extreme Heterogeneity

Abstract

Project Abstract: Runtime Systems and Programming Tools for Resilient Computing with Extreme Heterogeneity Overview: The impending end of CMOS transistor scaling (MooreÕs Law) is fueling a great diversity in hardware technologies being explored for the post-Moore era, resulting in an Òextreme heterogeneityÓ of computation and memory components that can be integrated in a single platform. Further, the failure rates for many of these components are likely to be extremely high, relative to current failure rates, especially with respect to soft/transient failures. These disruptive changes in computer hardware will require a fresh rethinking of software approaches to resilience and heterogeneous computing. Past work on software-based resilience techniques typically focused on a) homogeneous parallel architectures, b) recovery from hard failures, and c) the use of traditional bulk-synchronous parallel programming models. We propose a new approach to building runtime systems and programming tools that will instead support a) extreme heterogeneity, b) recovery from soft failures, and c) high levels of asynchrony in computation and data movement that are suitable for irregular algorithms used in data analytics (in contrast to bulk-synchronous parallelism). Runtime System: The foundation of our approach to runtime systems is an execution model derived from extensions to asynchronous event-driven tasks and asynchronous communications based on extensions to the actor model. The asynchrony inherent in this model will enable the runtime system to dynamically rebalance resource allocations, and adapt to soft failures. While classical techniques can be leveraged in our runtime to enable recovery from hard failures, a key part of the innovation and technical challenges in the runtime system relates to recovery from soft errors. In past work, we were among the first to show how work-stealing runtimes can be extended to support task parallelism on heterogeneous processors [LCTESÕ12], at that time for a combination of multicore CPU, GPU, and FPGA processors. Our past work on developing the Habanero-C/C++ library also showed how a unified runtime system can integrate extensions for different programming constructs, heterogeneity and distributed-memory parallelism [IPDPSÕ17]. Our recent work has demonstrated how our Habanero-C library can support resilience by enabling both task replay and task replication in a unified runtime [EuroParÕ19]. Programming Tools: Our approach to programming tools is founded on the asynchronous resilient execution model that will be supported by our runtime system. In a recent workshop on future asynchronous-PGAS directions (held at LPS on 2/21/19), it was broadly observed that the use of event-driven callbacks was a critical capability for achieving scalable performance for irregular algorithms, including those involving sparse matrices and graphs. At the same time, it was also observed that programming with callbacks severely hampers productivity. In past work, we have shown how tasks and actors can be integrated so as to obtain the benefits of event-driven execution without the burden of writing code with explicit callbacks [OOPSLAÕ12]. We have also demonstrated an implementation of our extensions to the actor model for distributed-memory parallelism [PPPJÕ16]. In this project, we propose to extend our past work so as to support a programming model for heterogeneity and resilience based on a unification of asynchronous event-driven tasks and actors. We will also leverage some of our recent experiences with FPGA synthesis from high-level programming languages for spatial architectures [FCCMÕ19]. The programming tool chain will be implemented using a subset of Python that is suitable for static (ahead-of-time compilation) that we are developing for our DARPA SDH project.

Document Details

Document Type
DoD Grant Award
Publication Date
Oct 01, 2019
Source ID
W911NF1910493

Entities

People

  • Vivek Sarkar

Organizations

  • Army Contracting Command
  • Georgia Tech Research Corporation
  • National Security Agency

Tags

Fields of Study

  • Computer science
  • Engineering

Readers

  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.
  • Systems Analysis and Design