A batched GEMM optimization framework for deep learning

Yang, Zhiwei; Lu, Lu; Wang, Ruimin

doi:10.1007/s11227-022-04336-3

A batched GEMM optimization framework for deep learning

Published: 19 March 2022

Volume 78, pages 13393–13408, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhiwei Yang¹,
Lu Lu^2,3 &
Ruimin Wang²

759 Accesses
Explore all metrics

Abstract

Generalized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, especially for convolutional neural networks, because many of the calculations involved are converted into matrix multiplications in order to speed up the computation process leveraging the parallel processing power of GPUs. However, the sizes of the converted matrices are generally too small to fully occupy the GPU. In this paper, we focus on the impact of GEMM on deep learning and propose a framework for calculating a batch of GEMMs in one kernel function so as to increase GPU occupancy. A suite of tiling strategies is designed for a batch of matrices with small dimensions and variable sizes. The tiling strategy is determined by considering Kernel Occupancy for each GEMM to fit different matrix sizes and GPU architectures. Then the GoogLeNet is implemented using MIOpen as a representative case and the batched GEMM framework is integrated into it. The experimental results show that compared with MAGMA, the elapsed time of the GoogLeNet optimized with our framework obtains 2.60$\times$ and 2.79$\times$ speedup on AMD Radeon Instinct MI50 and MI100 GPU, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel accelerated methods for convolution neural network with matrix core

Article 30 May 2023

VBATS: an adaptive strategy for grouped GEMM on GPUs

Article 15 April 2025

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

References

Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: International Conference on High Performance Computing, Springer, pp 21–38
Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Novel hpc techniques to batch execution of many variable size blas computations on GPUs. In: Proceedings of the International Conference on Supercomputing, pp 1–10
AMD (2021a) INTRODUCING AMD CDNA ARCHITECTURE The All-New AMD GPU Architecture for the Modern Era of HPC & AI. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
AMD (2021b) INTRODUCING RDNA ARCHITECTURE The all new Radeon gaming architecture powering “Navi”. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
AMD (2021c) rocBLAS Documentation. https://rocblas.readthedocs.io/_/downloads/en/rocm-4.5.2/pdf/
AMD (2021d) ROCm Documentation. https://rocmdocs.amd.com/_/downloads/en/latest/pdf/
Bao W, Chang LW, Chen Y, Deng K, Agarwal A, Barsoum E, Taha A (2019) Ngemm: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Tenth international workshop on frontiers in handwriting recognition, Suvisoft
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759
Intel (2021) Intel oneAPI Programming Guide. https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-programming-guide.pdf
Jia Y (2014) Learning semantic image representations at a large scale. University of California, Berkeley
Google Scholar
Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, Filippov V, Zhang J, Zhou J, Natarajan B, Daga M (2019) Miopen: An open source library for deep learning primitives. arXiv:1910.00078
Kim R, Choi J, Lee M (2019) Optimizing parallel gemm routines using auto-tuning with intel avx-512. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 101–110
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Kurzak J, Tomov S, Dongarra J (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Trans Parallel Distrib Syst 23(11):2045–2057
Article Google Scholar
Lai J, Seznec A (2013) Performance upper bound analysis and optimization of sgemm on fermi and kepler GPUs. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO), IEEE, pp 1–10
Li X, Zhang G, Huang HH, Wang Z, Zheng W (2016) Performance analysis of GPU-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), IEEE, pp 67–76
Li X, Liang Y, Yan S, Jia L, Li Y (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241
Lym S, Lee D, O’Connor M, Chatterjee N, Erez M (2019) Delta: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In: 2019 IEEE International symposium on performance analysis of systems and software (ISPASS), IEEE, pp 293–303
Nath R, Tomov S, Dongarra J (2010) An improved magma GEMM for fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515
Article Google Scholar
NVIDIA (2018) CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
NVIDIA (2021a) cuBLAS. https://docs.nvidia.com/cuda/cublas/
NVIDIA (2021b) CUDA Occupancy Calculator. https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
René van Oostrum DMea Noel Chalmers (2019) AMD GPU Hardware Basics. https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD), IEEE, pp 99–104
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N (2011) Fast implementation of dgemm on fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), IEEE, pp 19–24
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1492–1500
Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 32–44

Download references

Acknowledgements

This work was supported by The School of Computer Science and Engineering, South China University of Technology, Guangzhou, China No. 210602103890051 and the Major Project on the Integration of Industry, Education and Research of Zhongshan (210610173898370).

Author information

Authors and Affiliations

School of Software Engineering, South China University of Technology, Guanzhou, 510006, China
Zhiwei Yang
School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, Guangdong, China
Lu Lu & Ruimin Wang
Modern Industrial Technology Research Institute, South China University of Technology, Zhongshan, 528400, Guangdong, China
Lu Lu

Authors

Zhiwei Yang
View author publications
You can also search for this author inPubMed Google Scholar
Lu Lu
View author publications
You can also search for this author inPubMed Google Scholar
Ruimin Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lu Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Z., Lu, L. & Wang, R. A batched GEMM optimization framework for deep learning. J Supercomput 78, 13393–13408 (2022). https://doi.org/10.1007/s11227-022-04336-3

Download citation

Accepted: 24 January 2022
Published: 19 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11227-022-04336-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A batched GEMM optimization framework for deep learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Novel accelerated methods for convolution neural network with matrix core

VBATS: an adaptive strategy for grouped GEMM on GPUs

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now