research-article

Optimizing Massively Parallel Winograd Convolution on ARM Processor

Authors:

Yutong LuAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 70, Pages 1 - 12

https://doi.org/10.1145/3472456.3472496

Published: 05 October 2021 Publication History

Abstract

Convolution Neural Network (CNN) has gained a great success in deep learning applications and been accelerated by dedicated convolutional algorithms. Winograd-based algorithm can greatly reduce the number of arithmetic operations required in convolution. However, our experiments show that existing implementations in deep learning libraries cannot achieve expected parallel performance on ARM manycore CPUs with last-level cache (LLC). Compared to multicore processor, ARM manycore CPUs have more cores, more NUMA nodes and the parallel performance is more easily restricted by memory bandwidth, cache contention, NUMA configuration and etc. In this paper, we propose an optimized implementation for single-precision Winograd-based algorithm on ARM manycore CPUs. Our algorithm adjusts the data layout according to the input shape and is optimized for the characteristics of ARM processor, thus reducing the matrix transformation overhead and achieving high arithmetic intensity. We redesign the parallel algorithm for Winograd-based convolution to achieve a more efficient implementation for manycore CPUs. The experimental results with 32 cores show that for modern ConvNets, our implementation achieves speedups ranging from 3 × to 5 × over the state-of-the-art Winograd-based convolution on ARM processor. Even conducted on a set of convolutional benchmarks executing on a 128-core system with 4 NUMA nodes, the results show that our implementation can also achieve better performance than existing implementations on ARM processor.

References

[1]

2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. https://colfaxresearch.com/falcon-library/ Accessed: 03-28-2021.

[2]

Accessed: 03-28-2021. Arm Developer: Neon intrinsics guide. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics. Accessed: 03-28-2021.

[3]

Accessed: 03-28-2021. Intel® Math Kernel Library. https://software. intel.com/en-us/mkl. Accessed: 03-28-2021.

[4]

Accessed: 03-28-2021. MKL-DNN. https://github.com/oneapi-src/oneDNN Accessed: 03-28-2021.

[5]

Enrico Calore, Filippo Mantovani, and Daniel Ruiz. 2018. Advanced performance analysis of HPC workloads on Cavium ThunderX. In 2018 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 375–382.

[6]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759(2014). arxiv:1410.0759http://arxiv.org/abs/1410.0759

[7]

Matthieu Courbariaux, Yoshua Bengio, and J. David. 2015. Low precision arithmetic for deep learning. CoRR abs/1412.7024(2015).

[8]

Andi Drebes, Antoniu Pop, Karine Heydemann, Nathalie Drach, and Albert Cohen. 2016. NUMA-aware scheduling and memory allocation for data-flow task-parallel applications. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, Rafael Asenjoand Tim Harris (Eds.). ACM, 44:1–44:2. https://doi.org/10.1145/2851141.2851193

Digital Library

[9]

Marat Dukhan. Accessed: 03-28-2021. NNPACK. https://github.com/Maratyszcza/NNPACK Accessed: 03-28-2021.

[10]

Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. CoRR abs/1603.07285(2016). arxiv:1603.07285http://arxiv.org/abs/1603.07285

[11]

Michele Di Giorgio. Accessed: 03-28-2021. ARM Compute Library. https://github.com/ARM-software/ComputeLibrary Accessed: 03-28-2021.

[12]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1737–1746.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90

[14]

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2261–2269. https://doi.org/10.1109/CVPR.2017.243

[15]

Jianyu Huang and Robert A. van de Geijn. 2016. BLISlab: A Sandbox for Optimizing GEMM. FLAME Working Note #80, TR-16-13. The University of Texas at Austin, Department of Computer Science. http://arxiv.org/pdf/1609.00076v1.pdf

[16]

Zhen Jia, Aleksandar Zlateski, Frédo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, Andreas Krall and Thomas R. Gross (Eds.). ACM, 109–123. https://doi.org/10.1145/3178487.3178496

Digital Library

[17]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751. https://doi.org/10.3115/v1/D14-1181

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 1106–1114. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

[19]

Christoph Lameter. 2013. NUMA (Non-Uniform Memory Access): An Overview. ACM Queue 11, 7 (2013), 40. https://doi.org/10.1145/2508834.2513149

Digital Library

[20]

Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 4013–4021. https://doi.org/10.1109/CVPR.2016.435

[21]

Miloš Puzović, Srilatha Manne, Shay GalOn, and Makoto Ono. 2016. Quantifying energy use in dense shared memory HPC node. In 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC). IEEE, 16–23.

[22]

Tran Minh Quan, David G. C. Hildebrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. CoRR abs/1612.05360(2016). arxiv:1612.05360http://arxiv.org/abs/1612.05360

[23]

Lawrence R Rabiner, Bernard Gold, and CK Yuen. 2016. Theory and application of digital signal processing. Prentice-Hall.

[24]

David Seal. 2001. ARM architecture reference manual. Pearson Education.

[25]

J Shalf, J Bashor, D Patterson, K Asanovic, Katherine Yelick, K Keutzer, and T Mattson. 2009. The manycore revolution: Will HPC lead or follow?SciDAC Review 14(2009), 40–49.

[26]

Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking State-of-the-Art Deep Learning Software Tools. In 7th International Conference on Cloud Computing and Big Data, CCBD 2016, Macau, China, November 16-18, 2016. IEEE Computer Society, 99–104. https://doi.org/10.1109/CCBD.2016.029

[27]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556

[28]

Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.

[29]

Zhang Xianyi. Accessed: 03-28-2021. OpenBLAS. https://github.com/xianyi/OpenBLAS Accessed: 03-28-2021.

Cited By

Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Show More Cited By

Recommendations

Optimizing half precision Winograd convolution on ARM many-core processors
APSys '22: Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems

Convolutional Neural Networks (CNNs) are widely used in real world applications, e.g, computer vision. Winograd based convolution is usually applied due to its low computation complexity. For the underling hardware, ARM many-core CPUs, by their price ...
Optimizing Winograd-Based Convolution with Tensor Cores
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity ...
Optimizing batched winograd convolution on GPUs
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
309
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Wang XLi GJia ZFeng XWang Y(2024)Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUsACM Transactions on Architecture and Code Optimization10.1145/363295621:1(1-26)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3632956
Liu SGuo BFang CWang ZLuo SZhou ZYu Z(2024)Enabling Resource-Efficient AIoT System With Cross-Level Optimization: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2023.331995226:1(389-427)Online publication date: Sep-2025
https://doi.org/10.1109/COMST.2023.3319952
Fang XDong PLuo JLi LDing YJiang Z(2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
https://doi.org/10.1007/978-981-97-5591-2_11
Jiang JHuang ZHuang DDu JChen LChen ZLu Y(2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3605149
Jiang JDu JHuang DChen ZLu YLiao X(2023)Full-Stack Optimizing Transformer Inference on ARM Many-Core CPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328080534:7(2221-2235)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3280805
Gupta SPapadopoulou NPericàs M(2023)Accelerating CNN inference on long vector architectures via co-design2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00024(145-155)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00024
Zheng JJiang JDu JHuang DLu Y(2023)Optimizing massively parallel sparse matrix computing on ARM many-core processorParallel Computing10.1016/j.parco.2023.103035117:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.parco.2023.103035

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten