skip to main content
10.1145/3472456.3472496acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Optimizing Massively Parallel Winograd Convolution on ARM Processor

Published: 05 October 2021 Publication History

Abstract

Convolution Neural Network (CNN) has gained a great success in deep learning applications and been accelerated by dedicated convolutional algorithms. Winograd-based algorithm can greatly reduce the number of arithmetic operations required in convolution. However, our experiments show that existing implementations in deep learning libraries cannot achieve expected parallel performance on ARM manycore CPUs with last-level cache (LLC). Compared to multicore processor, ARM manycore CPUs have more cores, more NUMA nodes and the parallel performance is more easily restricted by memory bandwidth, cache contention, NUMA configuration and etc. In this paper, we propose an optimized implementation for single-precision Winograd-based algorithm on ARM manycore CPUs. Our algorithm adjusts the data layout according to the input shape and is optimized for the characteristics of ARM processor, thus reducing the matrix transformation overhead and achieving high arithmetic intensity. We redesign the parallel algorithm for Winograd-based convolution to achieve a more efficient implementation for manycore CPUs. The experimental results with 32 cores show that for modern ConvNets, our implementation achieves speedups ranging from 3 × to 5 × over the state-of-the-art Winograd-based convolution on ARM processor. Even conducted on a set of convolutional benchmarks executing on a 128-core system with 4 NUMA nodes, the results show that our implementation can also achieve better performance than existing implementations on ARM processor.

References

[1]
2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. https://colfaxresearch.com/falcon-library/ Accessed: 03-28-2021.
[2]
Accessed: 03-28-2021. Arm Developer: Neon intrinsics guide. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics. Accessed: 03-28-2021.
[3]
Accessed: 03-28-2021. Intel® Math Kernel Library. https://software. intel.com/en-us/mkl. Accessed: 03-28-2021.
[4]
Accessed: 03-28-2021. MKL-DNN. https://github.com/oneapi-src/oneDNN Accessed: 03-28-2021.
[5]
Enrico Calore, Filippo Mantovani, and Daniel Ruiz. 2018. Advanced performance analysis of HPC workloads on Cavium ThunderX. In 2018 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 375–382.
[6]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759(2014). arxiv:1410.0759http://arxiv.org/abs/1410.0759
[7]
Matthieu Courbariaux, Yoshua Bengio, and J. David. 2015. Low precision arithmetic for deep learning. CoRR abs/1412.7024(2015).
[8]
Andi Drebes, Antoniu Pop, Karine Heydemann, Nathalie Drach, and Albert Cohen. 2016. NUMA-aware scheduling and memory allocation for data-flow task-parallel applications. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, Rafael Asenjoand Tim Harris (Eds.). ACM, 44:1–44:2. https://doi.org/10.1145/2851141.2851193
[9]
Marat Dukhan. Accessed: 03-28-2021. NNPACK. https://github.com/Maratyszcza/NNPACK Accessed: 03-28-2021.
[10]
Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. CoRR abs/1603.07285(2016). arxiv:1603.07285http://arxiv.org/abs/1603.07285
[11]
Michele Di Giorgio. Accessed: 03-28-2021. ARM Compute Library. https://github.com/ARM-software/ComputeLibrary Accessed: 03-28-2021.
[12]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1737–1746.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
[14]
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2261–2269. https://doi.org/10.1109/CVPR.2017.243
[15]
Jianyu Huang and Robert A. van de Geijn. 2016. BLISlab: A Sandbox for Optimizing GEMM. FLAME Working Note #80, TR-16-13. The University of Texas at Austin, Department of Computer Science. http://arxiv.org/pdf/1609.00076v1.pdf
[16]
Zhen Jia, Aleksandar Zlateski, Frédo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, Andreas Krall and Thomas R. Gross (Eds.). ACM, 109–123. https://doi.org/10.1145/3178487.3178496
[17]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751. https://doi.org/10.3115/v1/D14-1181
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 1106–1114. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[19]
Christoph Lameter. 2013. NUMA (Non-Uniform Memory Access): An Overview. ACM Queue 11, 7 (2013), 40. https://doi.org/10.1145/2508834.2513149
[20]
Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 4013–4021. https://doi.org/10.1109/CVPR.2016.435
[21]
Miloš Puzović, Srilatha Manne, Shay GalOn, and Makoto Ono. 2016. Quantifying energy use in dense shared memory HPC node. In 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC). IEEE, 16–23.
[22]
Tran Minh Quan, David G. C. Hildebrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. CoRR abs/1612.05360(2016). arxiv:1612.05360http://arxiv.org/abs/1612.05360
[23]
Lawrence R Rabiner, Bernard Gold, and CK Yuen. 2016. Theory and application of digital signal processing. Prentice-Hall.
[24]
David Seal. 2001. ARM architecture reference manual. Pearson Education.
[25]
J Shalf, J Bashor, D Patterson, K Asanovic, Katherine Yelick, K Keutzer, and T Mattson. 2009. The manycore revolution: Will HPC lead or follow?SciDAC Review 14(2009), 40–49.
[26]
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking State-of-the-Art Deep Learning Software Tools. In 7th International Conference on Cloud Computing and Big Data, CCBD 2016, Macau, China, November 16-18, 2016. IEEE Computer Society, 99–104. https://doi.org/10.1109/CCBD.2016.029
[27]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556
[28]
Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.
[29]
Zhang Xianyi. Accessed: 03-28-2021. OpenBLAS. https://github.com/xianyi/OpenBLAS Accessed: 03-28-2021.

Cited By

View all
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. convolution
  3. parallelization
  4. winograd

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2024)Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUsACM Transactions on Architecture and Code Optimization10.1145/363295621:1(1-26)Online publication date: 19-Jan-2024
  • (2024)Enabling Resource-Efficient AIoT System With Cross-Level Optimization: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2023.331995226:1(389-427)Online publication date: Sep-2025
  • (2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
  • (2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
  • (2023)Full-Stack Optimizing Transformer Inference on ARM Many-Core CPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328080534:7(2221-2235)Online publication date: 1-Jul-2023
  • (2023)Accelerating CNN inference on long vector architectures via co-design2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00024(145-155)Online publication date: May-2023
  • (2023)Optimizing massively parallel sparse matrix computing on ARM many-core processorParallel Computing10.1016/j.parco.2023.103035117:COnline publication date: 1-Sep-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media