Optimizing Depthwise Convolutions on ARMv8 Architecture

Hao, Ruochen; Wang, Qinglin; Yin, Shangfei; Zhou, Tianyang; Zhang, Qingyang; Mei, Songzhu; Shen, Siqi; Liu, Jie

doi:10.1007/978-3-031-29927-8_34

Ruochen Hao¹⁴,
Qinglin Wang¹³,
Shangfei Yin¹⁴,
Tianyang Zhou¹⁴,
Qingyang Zhang¹⁴,
Songzhu Mei¹³,
Siqi Shen¹⁵ &
…
Jie Liu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13798))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

511 Accesses
1 Citations

Abstract

Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose a new direct implementation of depthwise convolutions by means of implicit padding, register tiling, etc. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementation can averagely deliver 4.88\(\times \) performance improvement over the existing direct ones in open-source libraries.

Supported by the National Natural Science Foundation of China under Grant No. 62002365.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ARM: Compute library (2021). https://github.com/ARM-software/ComputeLibrary. Accessed 3 Sept 2021
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)
Google Scholar
Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient SIMD compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Exp. 30(23), e4800 (2018)
Article Google Scholar
Harris, M.: Mapping computational concepts to GPUs. In: ACM SIGGRAPH 2005 Courses, SIGGRAPH 2005, p. 50-es. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1198555.1198768
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034 (2015)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR (2017)
Google Scholar
Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., Liu, J.: Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures. Perform. Eval. 102248 (2021). https://doi.org/10.1016/j.peva.2021.102248
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 675–678. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2647868.2654889
Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)
Article Google Scholar
Marvell: Thunderx_CP family (2022). https://www.marvell.com/server-processors/thunderx-arm-processors/thunderx-cp. Accessed 1 Jan 2022
Matsuoka, S.: Fugaku and A64FX: the first exascale supercomputer and its innovative arm CPU. In: 2021 Symposium on VLSI Circuits, pp. 1–3 (2021). 10.23919/VLSICircuits52068.2021.9492415
Google Scholar
Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762
OPEN AI LAB: Tengine (2021). https://github.com/OAID/Tengine. Accessed 3 Sept 2021
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)
Google Scholar
Phytium: FT-1500A/16 (2022). https://www.phytium.com.cn/Product/detail?language=1 &product_id=9. Accessed 1 Jan 2022
Rajovic, N., et al.: The Mont-Blanc prototype: an alternative approach for high-performance computing systems (2016)
Google Scholar
Renganarayana, L., Bondhugula, U., Derisavi, S., Eichenberger, A.E., O’Brien, K.: Compact multi-dimensional kernel extraction for register tiling. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1654059.1654105
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474
Singh, R.K., Gorantla, R.: DMENet: diabetic macular edema diagnosis using hierarchical ensemble of CNNs. PLOS One 15(2), e0220677 (2020)
Google Scholar
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823 (2019). https://doi.org/10.1109/CVPR.2019.00293
Tencent: FeatherCNN (2021). https://github.com/Tencent/FeatherCNN. Accessed 3 Sept 2021
Tencent: nCNN (2021). https://github.com/Tencent/ncnn. Accessed 3 Sept 2021
Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J.: Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 248–262. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_16
Chapter Google Scholar
Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107
Article Google Scholar
Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2019). https://doi.org/10.1109/IJCNN.2019.8852012
You, X., Yang, H., Luan, Z., Liu, Y., Qian, D.: Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In: Abramson, D., de Supinski, B.R. (eds.) SCFA 2019. LNCS, vol. 11416, pp. 86–105. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-18645-6_6
Chapter Google Scholar
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)
Google Scholar
Zhang, P., Lo, E., Lu, B.: High performance depthwise and pointwise convolutions on mobile devices. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6795–6802. AAAI Press (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073, China
Qinglin Wang & Songzhu Mei
College of Computer, National University of Defense Technology, Changsha, 410073, China
Ruochen Hao, Shangfei Yin, Tianyang Zhou, Qingyang Zhang & Jie Liu
Xiamen University, Xiamen, China
Siqi Shen

Authors

Ruochen Hao
View author publications
You can also search for this author in PubMed Google Scholar
Qinglin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shangfei Yin
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qingyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Songzhu Mei
View author publications
You can also search for this author in PubMed Google Scholar
Siqi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

Tohoku University, Aoba-ku, Japan
Hiroyuki Takizawa
Sun Yat-sen University, Guangzhou, China
Hong Shen
The University of Tokyo, Tokyo, Japan
Toshihiro Hanawa
Seoul National University of Science and Technology, Seoul, Korea (Republic of)
Jong Hyuk Park
Griffith University, Queensland, QLD, Australia
Hui Tian
Tokyo Denki University, Tokyo, Japan
Ryusuke Egawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hao, R. et al. (2023). Optimizing Depthwise Convolutions on ARMv8 Architecture. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-29927-8_34
Published: 08 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29926-1
Online ISBN: 978-3-031-29927-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Depthwise Convolutions on ARMv8 Architecture