Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Meng, Jintao; Zhuang, Chen; Chen, Peng; Wahib, Mohamed; Schmidt, Bertil; Wang, Xiao; Lan, Haidong; Wu, Dou; Deng, Minwen; Wei, Yanjie; Feng, Shenzhong

doi:10.1109/tpds.2022.3146257

Title: Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Journal Article · Thu Jan 27 00:00:00 EST 2022 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2022.3146257· OSTI ID:1863284

Meng, Jintao ^[1]; Zhuang, Chen ^[1]; Chen, Peng ^[2]; Wahib, Mohamed ^[2]; Schmidt, Bertil ^[3]; Wang, Xiao ^[4]; Lan, Haidong ^[5]; Wu, Dou ^[1]; Deng, Minwen ^[6]; Wei, Yanjie ^[1]; Feng, Shenzhong ^[1]

Shenzhen Institutes of Advanced Technology (China). High Performance Computing Center
National Inst. of Advanced Industrial Science and Technology (AIST), Tokyo (Japan). AIST/TokyoTech Open Innovation Lab.
Johannes Gutenberg Univ., Mainz (Germany). High Performance Computing Center
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Shandong Univ., Jinan (China)
Tencent, Shenzhen (China)

In this work, we present FastConv, a template-based code auto-generation open source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming convolution layers of convolutional neural networks. ARM CPUs cover a wide range designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. The experiments with layer-wise evaluation on the VGG--16 model confirms a 1.25x performance gains is got by tuning the Winograd library. Integrated comparison results shows 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, Arm NN, and FeatherCNN on the Kunpeng 920 beside few cases. Furthermore, problem size performance portability experiments with various convolution shapes shows that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920 . CPU performance portability evaluation on the VGG--16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x is observed on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; National Key Research and Development Program of China; National Natural Science Foundation of China (NSFC)

Grant/Contract Number:: AC05-00OR22725; JPMJPR20MA; JP21K17750; U1813203; 2018YFB0204403

OSTI ID:: 1863284

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 1; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (26)

LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation Heinecke, Alexander; Henry, Greg; Hutchinson, Maxwell SC16: International Conference for High-Performance Computing, Networking, Storage and Analysis, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.83	conference	November 2016
Modelling the ARMv8 architecture, operationally: concurrency and ISA Flur, Shaked; Gray, Kathryn E.; Pulte, Christopher POPL '16: The 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages https://doi.org/10.1145/2837614.2837615	conference	January 2016
ARMv8-A next-generation vector architecture for HPC Stephens, Nigel 2016 IEEE Hot Chips 28 Symposium (HCS) https://doi.org/10.1109/HOTCHIPS.2016.7936203	conference	August 2016
Batched matrix computations on hardware accelerators based on GPUs Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr The International Journal of High Performance Computing Applications, Vol. 29, Issue 2 https://doi.org/10.1177/1094342014567546	journal	April 2014
LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores Yang, Weiling; Fang, Jianbin; Dong, Dezun SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476217	conference	November 2021
Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs Kim, Sumin; Oh, Seunghwan; Yi, Youngmin HotMobile '21: The 22nd International Workshop on Mobile Computing Systems and Applications, Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications https://doi.org/10.1145/3446382.3448606	conference	February 2021
Cache-aware Roofline model: Upgrading the loft Ilic, Aleksandar; Pratas, Frederico; Sousa, Leonel IEEE Computer Architecture Letters, Vol. 13, Issue 1 https://doi.org/10.1109/L-CA.2013.6	journal	January 2014
Roofline: an insightful visual performance model for multicore architectures Williams, Samuel; Waterman, Andrew; Patterson, David Communications of the ACM, Vol. 52, Issue 4 https://doi.org/10.1145/1498765.1498785	journal	April 2009
Anatomy of High-Performance Many-Threaded Matrix Multiplication Smith, Tyler M.; Geijn, Robert van de; Smelyanskiy, Mikhail 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.110	conference	May 2014
IoT security: Review, blockchain solutions, and open challenges Khan, Minhaj Ahmad; Salah, Khaled Future Generation Computer Systems, Vol. 82 https://doi.org/10.1016/j.future.2017.11.022	journal	May 2018
Internet of Things (IoT): A vision, architectural elements, and future directions Gubbi, Jayavardhana; Buyya, Rajkumar; Marusic, Slaven Future Generation Computer Systems, Vol. 29, Issue 7 https://doi.org/10.1016/j.future.2013.01.010	journal	September 2013
FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures Lan, Haidong; Meng, Jintao; Hundt, Christian IEEE Transactions on Parallel and Distributed Systems, Vol. 31, Issue 3 https://doi.org/10.1109/TPDS.2019.2939785	journal	March 2020
Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs Maji, Partha; Mundy, Andrew; Dasika, Ganesh 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2) https://doi.org/10.1109/EMC249363.2019.00008	conference	February 2019
Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels Jia, Liancheng; Liang, Yun; Li, Xiuhong IEEE Transactions on Computers https://doi.org/10.1109/TC.2020.2973144	journal	January 2020
Fast Optimisation of Convolutional Neural Network Inference using System Performance Models Mulder, Rik; Radu, Valentin; Dubach, Christophe EuroSys '21: Sixteenth European Conference on Computer Systems, Proceedings of the 1st Workshop on Machine Learning and Systems https://doi.org/10.1145/3437984.3458840	conference	April 2021
Optimizing N-dimensional, winograd-based convolution for manycore CPUs Jia, Zhen; Zlateski, Aleksandar; Durand, Fredo PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178496	conference	February 2018
Optimizing batched winograd convolution on GPUs Yan, Da; Wang, Wei; Chu, Xiaowen PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3332466.3374520	conference	February 2020
Anatomy of high-performance matrix multiplication Goto, Kazushige; Geijn, Robert A. van de ACM Transactions on Mathematical Software, Vol. 34, Issue 3 https://doi.org/10.1145/1356052.1356053	journal	May 2008
High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps Zhong, Zhuoyao; Jin, Lianwen; Xie, Zecheng 2015 13th International Conference on Document Analysis and Recognition (ICDAR) https://doi.org/10.1109/ICDAR.2015.7333881	conference	August 2015
Fd-Mobilenet: Improved Mobilenet with a Fast Downsampling Strategy Qin, Zheng; Zhang, Zhaoning; Chen, Xiaotao 2018 25th IEEE International Conference on Image Processing (ICIP) https://doi.org/10.1109/ICIP.2018.8451355	conference	October 2018
Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures Georganas, Evangelos; Avancha, Sasikanth; Banerjee, Kunal SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00069	conference	November 2018
Fast Algorithms for Convolutional Neural Networks Lavin, Andrew; Gray, Scott 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.435	conference	June 2016
Optimizing Deep Learning Workloads on ARM GPU with TVM Zheng, Lanmin; Chen, Tianqi ReQuEST '18: Reproducible Quality-Efficient Systems Tournament, Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning https://doi.org/10.1145/3229762.3229764	conference	June 2018
I/O lower bounds for auto-tuning of convolutions in CNNs Zhang, Xiaoyang; Xiao, Junmin; Tan, Guangming PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3437801.3441609	conference	February 2021
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew ACM SIGPLAN Notices, Vol. 48, Issue 6 https://doi.org/10.1145/2499370.2462176	journal	June 2013
A Memory-aware Performance Optimization of Tensor Programs for Embedded Devices Joo, Sunwoong; Dusnoki, Attila; Bliss, Martyn 2020 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia) https://doi.org/10.1109/ICCE-Asia49877.2020.9277079	conference	November 2020

Similar Records

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU, In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium

Conference · Thu May 01 00:00:00 EDT 2014 · 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM · OSTI ID:1863284

Dong, Tingxing; Dobrev, Veselin; Kolev, Tzanio; +3 more

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

Conference · Tue May 10 00:00:00 EDT 2022 · OSTI ID:1863284

Pascuzzi, Vincent; Goli, Mehdi

Revisiting Temporal Blocking Stencil Optimizations

Conference · Thu Jun 01 00:00:00 EDT 2023 · OSTI ID:1863284

Zhang, Lingqi; Wahib, Mohamed; Chen, Peng; +4 more

Related Subjects

97 MATHEMATICS AND COMPUTING
convolution
program processors
libraries
tensors
shape
codes
artificial intelligence
AI
deep learning

Title: Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Citation Formats

References (26)

Similar Records

Related Subjects