Skip to main content

Optimizing Depthwise Convolutions on ARMv8 Architecture

  • Conference paper
  • First Online:
Parallel and Distributed Computing, Applications and Technologies (PDCAT 2022)

Abstract

Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose a new direct implementation of depthwise convolutions by means of implicit padding, register tiling, etc. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementation can averagely deliver 4.88\(\times \) performance improvement over the existing direct ones in open-source libraries.

Supported by the National Natural Science Foundation of China under Grant No. 62002365.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ARM: Compute library (2021). https://github.com/ARM-software/ComputeLibrary. Accessed 3 Sept 2021

  2. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)

    Google Scholar 

  3. Chen, X., Xie, P., Chi, L., Liu, J., Gong, C.: An efficient SIMD compression format for sparse matrix-vector multiplication. Concurr. Comput.: Pract. Exp. 30(23), e4800 (2018)

    Article  Google Scholar 

  4. Harris, M.: Mapping computational concepts to GPUs. In: ACM SIGGRAPH 2005 Courses, SIGGRAPH 2005, p. 50-es. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1198555.1198768

  5. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034 (2015)

    Google Scholar 

  6. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR (2017)

    Google Scholar 

  7. Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., Liu, J.: Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures. Perform. Eval. 102248 (2021). https://doi.org/10.1016/j.peva.2021.102248

  8. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 675–678. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2647868.2654889

  9. Li, S., Dou, Y., Niu, X., Lv, Q., Wang, Q.: A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection. Neurocomputing 230, 48–59 (2017)

    Article  Google Scholar 

  10. Marvell: Thunderx_CP family (2022). https://www.marvell.com/server-processors/thunderx-arm-processors/thunderx-cp. Accessed 1 Jan 2022

  11. Matsuoka, S.: Fugaku and A64FX: the first exascale supercomputer and its innovative arm CPU. In: 2021 Symposium on VLSI Circuits, pp. 1–3 (2021). 10.23919/VLSICircuits52068.2021.9492415

    Google Scholar 

  12. Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762

  13. OPEN AI LAB: Tengine (2021). https://github.com/OAID/Tengine. Accessed 3 Sept 2021

  14. Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)

    Google Scholar 

  15. Phytium: FT-1500A/16 (2022). https://www.phytium.com.cn/Product/detail?language=1 &product_id=9. Accessed 1 Jan 2022

  16. Rajovic, N., et al.: The Mont-Blanc prototype: an alternative approach for high-performance computing systems (2016)

    Google Scholar 

  17. Renganarayana, L., Bondhugula, U., Derisavi, S., Eichenberger, A.E., O’Brien, K.: Compact multi-dimensional kernel extraction for register tiling. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1654059.1654105

  18. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

  19. Singh, R.K., Gorantla, R.: DMENet: diabetic macular edema diagnosis using hierarchical ensemble of CNNs. PLOS One 15(2), e0220677 (2020)

    Google Scholar 

  20. Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823 (2019). https://doi.org/10.1109/CVPR.2019.00293

  21. Tencent: FeatherCNN (2021). https://github.com/Tencent/FeatherCNN. Accessed 3 Sept 2021

  22. Tencent: nCNN (2021). https://github.com/Tencent/ncnn. Accessed 3 Sept 2021

  23. Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J.: Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 248–262. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_16

    Chapter  Google Scholar 

  24. Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing winograd-based fast convolution algorithm on phytium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107

    Article  Google Scholar 

  25. Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2019). https://doi.org/10.1109/IJCNN.2019.8852012

  26. You, X., Yang, H., Luan, Z., Liu, Y., Qian, D.: Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In: Abramson, D., de Supinski, B.R. (eds.) SCFA 2019. LNCS, vol. 11416, pp. 86–105. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-18645-6_6

    Chapter  Google Scholar 

  27. Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)

    Google Scholar 

  28. Zhang, P., Lo, E., Lu, B.: High performance depthwise and pointwise convolutions on mobile devices. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6795–6802. AAAI Press (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hao, R. et al. (2023). Optimizing Depthwise Convolutions on ARMv8 Architecture. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-29927-8_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-29926-1

  • Online ISBN: 978-3-031-29927-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics