Abstract
Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60\(\times\) and 2.79\(\times\) speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively.











Similar content being viewed by others
References
Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: International Conference on High Performance Computing, Springer, pp 21–38
Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Novel hpc techniques to batch execution of many variable size blas computations on GPUs. In: Proceedings of the International Conference on Supercomputing, pp 1–10
AMD (2021a) INTRODUCING AMD CDNA ARCHITECTURE The All-New AMD GPU Architecture for the Modern Era of HPC & AI. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
AMD (2021b) INTRODUCING RDNA ARCHITECTURE The all new Radeon gaming architecture powering “Navi”. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
AMD (2021c) rocBLAS Documentation. https://rocblas.readthedocs.io/_/downloads/en/rocm-4.5.2/pdf/
AMD (2021d) ROCm Documentation. https://rocmdocs.amd.com/_/downloads/en/latest/pdf/
Bao W, Chang LW, Chen Y, Deng K, Agarwal A, Barsoum E, Taha A (2019) Ngemm: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759
Intel (2021) Intel oneAPI Programming Guide. https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-programming-guide.pdf
Jia Y (2014) Learning semantic image representations at a large scale. University of California, Berkeley
Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, Filippov V, Zhang J, Zhou J, Natarajan B, Daga M (2019) Miopen: An open source library for deep learning primitives. arXiv:1910.00078
Kim R, Choi J, Lee M (2019) Optimizing parallel gemm routines using auto-tuning with intel avx-512. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 101–110
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kurzak J, Tomov S, Dongarra J (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057
Lai J, Seznec A (2013) Performance upper bound analysis and optimization of sgemm on fermi and kepler GPUs. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO), IEEE, pp 1–10
Li X, Zhang G, Huang HH, Wang Z, Zheng W (2016) Performance analysis of GPU-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), IEEE, pp 67–76
Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241
Lym S, Lee D, O’Connor M, Chatterjee N, Erez M (2019) Delta: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In: 2019 IEEE International symposium on performance analysis of systems and software (ISPASS), IEEE, pp 293–303
Nath R, Tomov S, Dongarra J (2010) An improved magma GEMM for fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515
NVIDIA (2018) CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
NVIDIA (2021a) cuBLAS. https://docs.nvidia.com/cuda/cublas/
NVIDIA (2021b) CUDA Occupancy Calculator. https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
René van Oostrum DMea Noel Chalmers (2019) AMD GPU Hardware Basics. https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD), IEEE, pp 99–104
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N (2011) Fast implementation of dgemm on fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), IEEE, pp 19–24
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1492–1500
Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 32–44
Acknowledgements
This work was supported by The School of Computer Science and Engineering, South China University of Technology, Guangzhou, China No. 210602103890051 and the Major Project on the Integration of Industry, Education and Research of Zhongshan (210610173898370).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, Z., Lu, L. & Wang, R. A batched GEMM optimization framework for deep learning. J Supercomput 78, 13393–13408 (2022). https://doi.org/10.1007/s11227-022-04336-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04336-3