Skip to main content
Log in

A batched GEMM optimization framework for deep learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60\(\times\) and 2.79\(\times\) speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: International Conference on High Performance Computing, Springer, pp 21–38

  2. Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Novel hpc techniques to batch execution of many variable size blas computations on GPUs. In: Proceedings of the International Conference on Supercomputing, pp 1–10

  3. AMD (2021a) INTRODUCING AMD CDNA ARCHITECTURE The All-New AMD GPU Architecture for the Modern Era of HPC & AI. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf

  4. AMD (2021b) INTRODUCING RDNA ARCHITECTURE The all new Radeon gaming architecture powering “Navi”. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

  5. AMD (2021c) rocBLAS Documentation. https://rocblas.readthedocs.io/_/downloads/en/rocm-4.5.2/pdf/

  6. AMD (2021d) ROCm Documentation. https://rocmdocs.amd.com/_/downloads/en/latest/pdf/

  7. Bao W, Chang LW, Chen Y, Deng K, Agarwal A, Barsoum E, Taha A (2019) Ngemm: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178

  8. Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft

  9. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759

  10. Intel (2021) Intel oneAPI Programming Guide. https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-programming-guide.pdf

  11. Jia Y (2014) Learning semantic image representations at a large scale. University of California, Berkeley

    Google Scholar 

  12. Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, Filippov V, Zhang J, Zhou J, Natarajan B, Daga M (2019) Miopen: An open source library for deep learning primitives. arXiv:1910.00078

  13. Kim R, Choi J, Lee M (2019) Optimizing parallel gemm routines using auto-tuning with intel avx-512. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 101–110

  14. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  15. Kurzak J, Tomov S, Dongarra J (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057

    Article  Google Scholar 

  16. Lai J, Seznec A (2013) Performance upper bound analysis and optimization of sgemm on fermi and kepler GPUs. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO), IEEE, pp 1–10

  17. Li X, Zhang G, Huang HH, Wang Z, Zheng W (2016) Performance analysis of GPU-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), IEEE, pp 67–76

  18. Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241

  19. Lym S, Lee D, O’Connor M, Chatterjee N, Erez M (2019) Delta: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In: 2019 IEEE International symposium on performance analysis of systems and software (ISPASS), IEEE, pp 293–303

  20. Nath R, Tomov S, Dongarra J (2010) An improved magma GEMM for fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515

    Article  Google Scholar 

  21. NVIDIA (2018) CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/

  22. NVIDIA (2021a) cuBLAS. https://docs.nvidia.com/cuda/cublas/

  23. NVIDIA (2021b) CUDA Occupancy Calculator. https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html

  24. René van Oostrum DMea Noel Chalmers (2019) AMD GPU Hardware Basics. https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf

  25. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  26. Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD), IEEE, pp 99–104

  27. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  28. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9

  29. Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N (2011) Fast implementation of dgemm on fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11

  30. Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), IEEE, pp 19–24

  31. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1492–1500

  32. Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 32–44

Download references

Acknowledgements

This work was supported by The School of Computer Science and Engineering, South China University of Technology, Guangzhou, China No. 210602103890051 and the Major Project on the Integration of Industry, Education and Research of Zhongshan (210610173898370).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Lu, L. & Wang, R. A batched GEMM optimization framework for deep learning. J Supercomput 78, 13393–13408 (2022). https://doi.org/10.1007/s11227-022-04336-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04336-3

Keywords

Navigation