Skip to main content

Optimizing Pointwise Convolutions on Multi-core DSPs

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2023)

Abstract

Pointwise convolutions are widely used in various convolutional neural networks, due to low computation complexity and parameter requirements. However, pointwise convolutions are still time-consuming like regular convolutions. As a result of increasing power consumption, low-power embedded processors have been brought into high-performance computing field, such as multi-core digital signal processors (DSPs). In this paper, we propose a high-performance multi-level parallel direct implementation of pointwise convolutions on multi-core DSPs in FT-M7032, a CPU-DSP heterogeneous prototype processor. The main optimizations include on-chip memory blocking, loop ordering, vectorization, register blocking, and multi-core parallelization. The experimental results show that the proposed direct implementation achieves much better performance than GEMM-based ones on FT-M7032, and a speedup of up to 79.26 times is achieved.

supported by the National Natural Science Foundation of China under grant nos. 62002365 and 62025208.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arm Corporation: Arm computer library: A software library for machine learning. https://www.arm.com/technologies/compute-library (2023). Accessed 3 Jan 2023

  2. Chaudhary, N., et al.: Efficient and generic 1d dilated convolution layer for deep learning. arXiv preprint arXiv:2104.08002 (2021)

  3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  4. Chen, X., Liu, J., Pang, Y., Chen, J., Chi, L., Gong, C.: Developing a new mesh quality evaluation method based on convolutional neural network. Eng. Appl. Comput. Fluid Mech. 14(1), 391–400 (2020)

    Google Scholar 

  5. Chetlur, S., et al.: CUDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  7. Heinecke, A., et al.: Understanding the performance of small convolution operations for CNN on intel architecture. In: Poster in the International Conference for High Performance Computing, Networking, Storage, and Analysis (2017)

    Google Scholar 

  8. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR (2017)

    Google Scholar 

  9. Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., Liu, J.: Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures. Perform. Eval. 49, 102248 (2021). https://doi.org/10.1016/j.peva.2021.102248

    Article  Google Scholar 

  10. Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., Liu, J.: NUMA-aware FFT-based convolution on armv8 many-core CPUs. In: 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1019–1026 (2021). https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00142

  11. Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)

    Google Scholar 

  12. Kim, M., Park, C., Kim, S., Hong, T., Ro, W.W.: Efficient dilated-winograd convolutional neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2711–2715. IEEE (2019)

    Google Scholar 

  13. Lu, G., Zhang, W., Wang, Z.: Optimizing GPU memory transactions for convolution operations. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 399–403. IEEE (2020)

    Google Scholar 

  14. Lu, G., Zhang, W., Wang, Z.: Optimizing Depthwise separable convolution operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 33(1), 70–87 (2021)

    Article  Google Scholar 

  15. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 561–580. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_34

    Chapter  Google Scholar 

  16. Mogers, N., Radu, V., Li, L., Turner, J., O’Boyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile GPUs. In: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit, pp. 41–50 (2020)

    Google Scholar 

  17. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  18. Pei, X., et al.: Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Defense Technol. 45(1), 57–66 (2023)

    Google Scholar 

  19. Safonov, I., Kornilov, A., Makienko, D.: An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP. Electronics 12, 78 (2022)

    Article  Google Scholar 

  20. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

    Google Scholar 

  21. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  22. Wang, Q., Li, D., Huang, X., Shen, S., Mei, S., Liu, J.: Optimizing FFT-based convolution on ARMv8 multi-core CPUs. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 248–262. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_16

    Chapter  Google Scholar 

  23. Wang, Q., Li, D., Mei, S., Lai, Z., Dou, Y.: Optimizing Winograd-based fast convolution algorithm on Pythium multi-core CPUs (in Chinese). J. Comput. Res. Dev. 57(6), 1140–1151 (2020). https://doi.org/10.7544/issn1000-1239.2020.20200107

    Article  Google Scholar 

  24. Wang, Q., Li, D., Mei, S., Shen, S., Huang, X.: Optimizing one by one direct convolution on ARMV8 multi-core CPUs. In: 2020 IEEE International Conference on Joint Cloud Computing, pp. 43–47. IEEE (2020). https://doi.org/10.1109/JCC49151.2020.00016

  25. Wang, Q., et al.: Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Defense Technol. 45(1), 86–94 (2023). https://doi.org/10.11887/j.cn.202301009

    Article  Google Scholar 

  26. Wang, Q., Songzhu, M., Liu, J., Gong, C.: Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2019). https://doi.org/10.1109/IJCNN.2019.8852012

  27. Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPs. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461 (2022). https://doi.org/10.1109/CLUSTER51413.2022.00055

  28. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinglin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Wang, Q., Pei, X., Mei, S., Liu, J. (2024). Optimizing Pointwise Convolutions on Multi-core DSPs. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14493. Springer, Singapore. https://doi.org/10.1007/978-981-97-0862-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0862-8_13

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0861-1

  • Online ISBN: 978-981-97-0862-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics