Programmable Accelerator Networks

Abstract

In response to the inherent inefficiencies in modern HPC interconnects, we seek to define a new, hybrid approach that combines software defined networking (SDN) concepts with high performance, low-latency network interconnect technologies with embedded, full-featured compute accelerators. This paradigm, herein referred to as the Programmable Accelerator Network or PAN, is designed to provide computational capabilities in-situ within the network infrastructure in order to reduce the time and associated energy to move all data elements from distant compute nodes into single nodes for a target algorithmic construct. In this manner, PAN is designed to compute within the network in order to significantly 1) improve bandwidth utilization 2) reduce the total system energy consumption and 3) improve the overall throughput and efficiency of the total system architecture. Unlike a traditional high performance network interconnect, the PAN interconnect infrastructure will support messaging facilities that include both data and binary code snippets. In this manner, we seek to support both traditional network operations (data transfers, collectives, RDMA, etc) as well operations that compute in the network. This provides us the ability to utilize the entire network as a target accelerator within a parallel application. As a result, the PAN infrastructure will provide the ability to accelerate parallel applications that are traditionally bound by low compute efficiency and small messages by combining these operations into a single solution. This method will provide advantages for traditionally sparse operations such as large scale sparse solvers and graph algorithms. The difficulty in realizing a fully programmable, high performance network infrastructure is three fold. First, we must develop simple and concise extensions to standard parallel programming models (MPI, OpenSHMEM, etc) that abstracts the complexities of the network topology, routing and congestion as well as provide common accelerator interfaces to the network. Second, we must also develop the necessary methodologies, tools and libraries required to perform traditional system-level functions with PAN. Finally, we must develop the necessary hardware infrastructure for PAN that promotes higher system energy/performance than current high performance interconnects. The proposed effort will require three phases of research and development. The first phase of development will include initial research into the necessary data motion and SDN protocol over- lays necessary to operate a programmable accelerator network. The first phase will also include the development of an integrated simulation infrastructure based upon the existing Sandia Structural Simulation Toolkit (SST). The second phase will develop the necessary programming model concepts to utilize the programmable network resources. This will include development of MPI and OpenSHMEM programming model abstractions that utilize the PAN infrastructure as well as candidate benchmark applications executing on the simulator developed in Phase 1. The final phase will include the development of a hardware prototype of the design using appropriate FPGA boards capable of high performance networking as well as extended RISC-V cores for the in-situ PAN compute capabilities. The final phase will culminate in a hardware prototype demonstration using the benchmark applications developed in Phase 2.

Document Details

Document Type
DoD Grant Award
Publication Date
Jan 06, 2020
Source ID
W911NF2010003

Entities

People

  • John Leidel

Organizations

  • Army Contracting Command
  • National Security Agency

Tags

Fields of Study

  • Computer science

Readers

  • Computer Networking
  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.