PatDNN: Towards 100X Acceleration and Real-Time DNN Execution on Mobile Platforms

Abstract

With the emergence of a spectrum of high-end mobile platforms, many applications that formerly required server-level and desktop-level computation capability are being migrated to these devices. However, the inference execution of Deep Neural Networks (DNNs) is still challenging considering the high computation and storage demands, specifically, if real-time performance with a high accuracy is needed. Not to mention online adaptive learning. Overcoming this challenge is essential to the ubiquitous sensing, communication, perception and control tasks that are of army interests. Weight pruning of DNNs has been proposed for computation and storage reduction, and thereby inference acceleration. However, the current schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with a higher accuracy loss. This proposal advances the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use compiler to regain and guarantee high hardware efficiency. In other words, the proposed method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end acceleration framework of DNN inference and online learning, on a spectrum of mobile platforms including embedded CPUs and GPUs, and FPGAs. PatDNN uniquely enables truly cross-layer vertical integration, from real-life application down to executable codes for mobile platforms and FPGA hardware prototypes. PatDNN consists of innovations at both algorithm level and compiler/hardware level. The former includes kernel pattern and connectivity pruning, mobile platform-aware weight/activation quantization, fine-grained structured pruning for fully-connected layers and RNNs, online adaptive learning, and FPGA-specific optimizations, unified under the proposed extended ADMM-based solution framework. At compiler/hardware level, PatDNN includes an execution code generation framework with multiple optimizations (filter kernel re-order, load redundancy elimination, etc.) and FPGA-specific optimizations and hardware prototyping. Finally, this proposal proposes a neural network and hardware co-design framework with automatic hyperparameter determination based on hardware characteristics and modeling. The partially optimized, preliminary results achieve an unprecedented 18ms end-to-end inference time for the large-scale DNN VGG-16 (ImageNet dataset) on Adreno 540 embedded GPU, without accuracy loss. It achieves 16.7X and 56X speedups compared with TVM and TFLite, respectively, on the same computing device. The end-to-end inference time is anticipated to be reduced by over 100X compared with TFLite, using the fully-optimized PatDNN. For the first time, one has the potential to enable realtime execution of most large-scale DNNs using mobile devices, while providing ARO with the solution kits for DNN accelerations for ubiquitous sensing, perception and control tasks in army applications.

Document Details

Document Type
DoD Grant Award
Publication Date
Jul 09, 2020
Source ID
W911NF2010167

Entities

People

  • Yanzhi Wang

Organizations

  • Army Contracting Command
  • Northeastern University
  • United States Army

Tags

Fields of Study

  • Computer science

Readers

  • Distributed Systems and Data Platform Development
  • Neural Network Machine Learning.
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Neural Networks
  • Space