Distributed deep learning training using silicon photonic switched architectures
Abstract
The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.
Document Details
- Document Type
- Pub Defense Publication
- Publication Date
- Mar 01, 2022
- Source ID
- 10.1063/5.0070711
Entities
People
- Keren Bergman
- Maarten Hattink
- Madeleine Strom Glick
- Min Yee Teh
- Shijia Yan
- Zhenguo Wu
- Ziyi Zhu
Organizations
- Columbia University