Abstract
Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose a new direct implementation of depthwise convolutions by means of implicit padding, register tiling, etc. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementation can averagely deliver 4.88\(\times \) performance improvement over the existing direct ones in open-source libraries.
Supported by the National Natural Science Foundation of China under Grant No. 62002365.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ARM: Compute library (2021). https://github.com/ARM-software/ComputeLibrary. Accessed 3 Sept 2021
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)
Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient SIMD compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Exp. 30(23), e4800 (2018)
Harris, M.: Mapping computational concepts to GPUs. In: ACM SIGGRAPH 2005 Courses, SIGGRAPH 2005, p. 50-es. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1198555.1198768
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034 (2015)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR (2017)
Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., Liu, J.: Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures. Perform. Eval. 102248 (2021). https://doi.org/10.1016/j.peva.2021.102248
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 675–678. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2647868.2654889
Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)
Marvell: Thunderx_CP family (2022). https://www.marvell.com/server-processors/thunderx-arm-processors/thunderx-cp. Accessed 1 Jan 2022
Matsuoka, S.: Fugaku and A64FX: the first exascale supercomputer and its innovative arm CPU. In: 2021 Symposium on VLSI Circuits, pp. 1–3 (2021). 10.23919/VLSICircuits52068.2021.9492415
Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762
OPEN AI LAB: Tengine (2021). https://github.com/OAID/Tengine. Accessed 3 Sept 2021
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)
Phytium: FT-1500A/16 (2022). https://www.phytium.com.cn/Product/detail?language=1 &product_id=9. Accessed 1 Jan 2022
Rajovic, N., et al.: The Mont-Blanc prototype: an alternative approach for high-performance computing systems (2016)
Renganarayana, L., Bondhugula, U., Derisavi, S., Eichenberger, A.E., O’Brien, K.: Compact multi-dimensional kernel extraction for register tiling. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1654059.1654105
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474
Singh, R.K., Gorantla, R.: DMENet: diabetic macular edema diagnosis using hierarchical ensemble of CNNs. PLOS One 15(2), e0220677 (2020)
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823 (2019). https://doi.org/10.1109/CVPR.2019.00293
Tencent: FeatherCNN (2021). https://github.com/Tencent/FeatherCNN. Accessed 3 Sept 2021
Tencent: nCNN (2021). https://github.com/Tencent/ncnn. Accessed 3 Sept 2021
Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J.: Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 248–262. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_16
Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107
Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2019). https://doi.org/10.1109/IJCNN.2019.8852012
You, X., Yang, H., Luan, Z., Liu, Y., Qian, D.: Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In: Abramson, D., de Supinski, B.R. (eds.) SCFA 2019. LNCS, vol. 11416, pp. 86–105. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-18645-6_6
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)
Zhang, P., Lo, E., Lu, B.: High performance depthwise and pointwise convolutions on mobile devices. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6795–6802. AAAI Press (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hao, R. et al. (2023). Optimizing Depthwise Convolutions on ARMv8 Architecture. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-29927-8_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29926-1
Online ISBN: 978-3-031-29927-8
eBook Packages: Computer ScienceComputer Science (R0)