Skip to main content
Log in

Novel accelerated methods for convolution neural network with matrix core

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The powerful parallel computing capability of GPU and the development of matrix processing unit in recent years provide more possibilities to improve the performance of convolutional neural network (CNN) on GPU. For the Winograd convolution algorithm, which is the most widely used in CNN and has the best performance, there are already some tuning results, but they all ignore the utilization of the matrix operation unit and fail to make full use of the computing resources of GPU. This paper introduces a single precision accelerated solution on GPU for CNN. According to the indicators of architecture, the optimal data layout, grid division and block division methods are derived. In order to adapt to a variety of padding in practical application, an efficient dynamic scheme for filling is designed, and by the use of matrix cores, a pipeline algorithm with operator fusion is implemented. The deep learning accelerated library MIOpen in AMD is used as the baseline. Taking several convolutional layers of ResNet50 as the experimental input, the evaluation shows that our approach outperforms MIOpen with the speedup of 1.41x on MI210, and reaches 74% of the peak value of single precision calculations. Applying this method to the training and inference of ResNet50, the speedup of 1.68x is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Availability of data and materials

Data and materials sharing not applicable to this article as no datasets were generated or analyzed during the current study, and source data are provided with the paper in Figs. 456789 and 10.

References

  1. Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9(4):611–629

    Article  Google Scholar 

  2. Lee H, Kwon H (2017) Going deeper with contextual cnn for hyperspectral image classification. IEEE Trans Image Process 26(10):4843–4855

    Article  MathSciNet  Google Scholar 

  3. Salvador A, Giró-i-Nieto X, Marqués F, Satoh S (2016) Faster r-cnn features for instance search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops pp 9–16

  4. Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 5977–5986

  5. Sharma S, Shanmugasundaram K, Ramasamy SK (2016) Farec-cnn based efficient face recognition technique using dlib. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) pp 192–195. IEEE

  6. Saranya A, Kottursamy K, AlZubi AA, Bashir AK (2022) Analyzing fibrous tissue pattern in fibrous dysplasia bone images using deep r-cnn networks for segmentation. Soft Comput 26(16):7519–7533

    Article  Google Scholar 

  7. Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao K-T, Rose G, Chakma G (2018) A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC) 14(2):1–21

    Article  Google Scholar 

  8. Chang M-C, Pan Z-G, Chen J-L (2017) Hardware accelerator for boosting convolution computation in image classification applications. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE) pp. 1–2 IEEE

  9. Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, et al (2019) Miopen: An open source library for deep learning primitives. arXiv preprint arXiv:1910.00078

  10. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B (2019) Shelhamer, e. cudnn: Efficient primitives for deep learning. arxiv 2014. arXiv preprint arXiv:1410.0759

  11. NVIDIA: cutlass. https://github.com/NVIDIA/cutlass (2022)

  12. Georganas E, Avancha S, Banerjee K, Kalamkar D, Henry G, Pabst H, Heinecke A (2018) Anatomy of high-performance deep learning convolutions on simd architectures. In: SC18: International Conference for High Performance Computing, Networking Storage and Analysis, pp 830–841. IEEE

  13. Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851

  14. Lavin A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021

  15. INTEL/oneapi-src: oneDNN. https://github.com/oneapi-src/oneDNN (2021)

  16. Tencent: ncnn. https://github.com/Tencent/ncnn (2022)

  17. Winograd S (1980) Arithmetic complexity of computations, vol 33. Siam, India

    Book  MATH  Google Scholar 

  18. Kuo L-W, Yang C-C, Lee J-K, Tseng S-Y (2014) The design of llvm-based shader compiler for embedded architecture. In: 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 961–968. IEEE

  19. Horn RA (1990) The hadamard product. In: Proc Symp Appl Math vol 40: pp 87–169

  20. ROCmSoftwarePlatform: rocWMMA. https://github.com/ROCmSoftwarePlatform/rocWMMA (2022)

  21. Theckedath D, Sedamkar R (2020) Detecting affect states using vgg16, resnet50 and se-resnet50 networks. SN Comput Sci 1(2):1–7

    Article  Google Scholar 

  22. Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) pp 19–24. IEEE

  23. Chikin V, Kryzhanovskiy V (2022) Channel balancing for accurate quantization of winograd convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12507–12516

  24. Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on gpus. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 32–44

  25. Castro RL, Andrade D, Fraguela BB (2021) Opencnn: a winograd minimal filtering algorithm implementation in cuda. Mathematics 9(17):2033

    Article  Google Scholar 

  26. Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531 IEEE

  27. Jia L, Liang Y, Li X, Lu L, Yan S (2020) Enabling efficient fast convolution algorithms on gpus via megakernels. IEEE Trans Comput 69(7):986–997

    MathSciNet  MATH  Google Scholar 

  28. Jiang J, Huang D, Du J, Lu Y, Liao X (2022) Optimizing small channel 3d convolution on gpu with tensor core. Parallel Comput 113:102954

    Article  MathSciNet  Google Scholar 

  29. Jia Z, Zlateski A, Durand F, Li K (2018) Optimizing n-dimensional, winograd-based convolution for manycore cpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 109–123

  30. Ma Y, Cao Y, Vrudhula S, Seo J-s (2018) Optimizing the convolution operation to accelerate deep neural networks on fpga. IEEE Trans Very Large Scale Int (VLSI) Syst 26(7): 1354–1367

  31. Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance cnn accelerator on fpga using unified winograd-gemm architecture. IEEE Trans Very Large Scale Int (VLSI) Syst 27(12): 2816–2828

  32. Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp 1–6

  33. Smith A, James N (2022) Amd instinct\(^{{\rm TM}}\) mi200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp 1–23. IEEE Computer Society

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 770–778

  35. Mattson P, Reddi VJ, Cheng C, Coleman C, Diamos G, Kanter D, Micikevicius P, Patterson D, Schmuelling G, Tang H et al (2020) Mlperf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2):8–16

    Article  Google Scholar 

  36. Wei H, Liu E, Zhao Y, Yu H (2020) Efficient non-fused winograd on gpus. In: Computer Graphics International Conference pp 411–418. Springer

  37. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) \(\{\)TensorFlow\(\}\): a system for \(\{\)Large-Scale\(\}\) machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) pp 265–283

  38. NervanaSystems: neon. https://github.com/NervanaSystems/neon (2019)

  39. ROCmSoftwarePlatform: rocRAND. https://github.com/ROCmSoftwarePlatform/rocRAND (2019)

  40. Sun Y, Mukherjee S, Baruah T, Dong S, Gutierrez J, Mohan P, Kaeli D (2018) Evaluating performance tradeoffs on the radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) pp 209–218. IEEE

  41. Zhou Y, Yang M, Guo C, Leng J, Liang Y, Chen Q, Guo M, Zhu Y (2021) Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In: 2021 IEEE International Symposium on Workload Characterization (IISWC) pp 214–225. IEEE

  42. Tsai YM, Cojean T, Anzt H (2020) Evaluating the performance of nvidia’s a100 ampere gpu for sparse linear algebra computations. arXiv preprint arXiv:2008.08478

Download references

Funding

This work was supported by the second batch of cultivation projects of Pazhou Laboratory in 2022, No.PZL2022KF0008 and the Major Key Project of PCL.

Author information

Authors and Affiliations

Authors

Contributions

Yijie Guo and Songxiang Zhu do the research and wrote the main manuscript text with the guidance of Lu Lu. All authors reviewed the manuscript.

Corresponding author

Correspondence to Lu Lu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Y., Lu, L. & Zhu, S. Novel accelerated methods for convolution neural network with matrix core. J Supercomput 79, 19547–19573 (2023). https://doi.org/10.1007/s11227-023-05399-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05399-6

Keywords

Navigation