Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning
- Shenzhen Institutes of Advanced Technology (China). High Performance Computing Center
- National Inst. of Advanced Industrial Science and Technology (AIST), Tokyo (Japan). AIST/TokyoTech Open Innovation Lab.
- Johannes Gutenberg Univ., Mainz (Germany). High Performance Computing Center
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Shandong Univ., Jinan (China)
- Tencent, Shenzhen (China)
In this work, we present FastConv, a template-based code auto-generation open source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming convolution layers of convolutional neural networks. ARM CPUs cover a wide range designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. The experiments with layer-wise evaluation on the VGG--16 model confirms a 1.25x performance gains is got by tuning the Winograd library. Integrated comparison results shows 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, Arm NN, and FeatherCNN on the Kunpeng 920 beside few cases. Furthermore, problem size performance portability experiments with various convolution shapes shows that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920 . CPU performance portability evaluation on the VGG--16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x is observed on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE; National Key Research and Development Program of China; National Natural Science Foundation of China (NSFC)
- Grant/Contract Number:
- AC05-00OR22725; JPMJPR20MA; JP21K17750; U1813203; 2018YFB0204403
- OSTI ID:
- 1863284
- Journal Information:
- IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 1; ISSN 1045-9219
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation
|
conference | November 2016 |
Modelling the ARMv8 architecture, operationally: concurrency and ISA
|
conference | January 2016 |
ARMv8-A next-generation vector architecture for HPC
|
conference | August 2016 |
Batched matrix computations on hardware accelerators based on GPUs
|
journal | April 2014 |
LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores
|
conference | November 2021 |
Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
|
conference | February 2021 |
Cache-aware Roofline model: Upgrading the loft
|
journal | January 2014 |
Roofline: an insightful visual performance model for multicore architectures
|
journal | April 2009 |
Anatomy of High-Performance Many-Threaded Matrix Multiplication
|
conference | May 2014 |
IoT security: Review, blockchain solutions, and open challenges
|
journal | May 2018 |
Internet of Things (IoT): A vision, architectural elements, and future directions
|
journal | September 2013 |
FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures
|
journal | March 2020 |
Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs
|
conference | February 2019 |
Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels
|
journal | January 2020 |
Fast Optimisation of Convolutional Neural Network Inference using System Performance Models
|
conference | April 2021 |
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
|
conference | February 2018 |
Optimizing batched winograd convolution on GPUs
|
conference | February 2020 |
Anatomy of high-performance matrix multiplication
|
journal | May 2008 |
High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps
|
conference | August 2015 |
Fd-Mobilenet: Improved Mobilenet with a Fast Downsampling Strategy
|
conference | October 2018 |
Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures
|
conference | November 2018 |
Fast Algorithms for Convolutional Neural Networks
|
conference | June 2016 |
Optimizing Deep Learning Workloads on ARM GPU with TVM
|
conference | June 2018 |
I/O lower bounds for auto-tuning of convolutions in CNNs
|
conference | February 2021 |
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
|
journal | June 2013 |
A Memory-aware Performance Optimization of Tensor Programs for Embedded Devices
|
conference | November 2020 |
Similar Records
Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library
Revisiting Temporal Blocking Stencil Optimizations