A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

Wang, Ruimin; Yang, Zhiwei; Xu, Hao; Lu, Lu

doi:10.1007/s11227-021-03936-9

A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

Published: 21 June 2021

Volume 78, pages 1741–1758, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ruimin Wang¹,
Zhiwei Yang¹,
Hao Xu¹ &
…
Lu Lu¹

589 Accesses
3 Citations
Explore all metrics

Abstract

In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today’s BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Q. Huangfu & J. A. J. Hall

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Inmaculada Santamaria-Valenzuela, Rocío Carratalá-Sáez, … Arturo Gonzalez-Escribano

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Fernando Fernandes dos Santos & Paolo Rech

References

Agullo E et al (2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J Phys Confer Ser 180. No. 1. IOP Publishing
Molero JM, Garzón EM, García I, Quintana-Ortí ES, Plaza A (2013) A Batched Cholesky Solver for Local RX Anomaly Detection on GPUs. In: Proceedings of the 13th international conference on computational and mathematical methods in science and engineering, (CMMSE), pp 1037–1797
Shi Y et al (2016) Tensor contractions with extended blas kernels on cpu and gpu. 2016 IEEE 23rd international conference on high performance computing (HiPC). IEEE
Ahmad A et al (2016) High-performance tensor contractions for GPUs. Proc Comput Sci 80:108–118
Article Google Scholar
AMD Corporation (2021) https://github.com/ROCmSoftwarePlatform/rocBLAS
NVIDIA Corporation (2021) https://docs.nvidia.com/cuda/cublas/
Intel Corporation (2021) https://software.intel.com/en-us/intel-mkl
MAGMA project (2021) http://icl.cs.utk.edu/MAGMA/
Abdelfattah A et al (2016) Performance, design, and autotuning of batched GEMM for GPUs. International Conference on High Performance Computing. Springer, Cambridge
Zhang F et al (2017) A GPU based memory optimized parallel method for FFT implementation. arXiv preprint arXiv:1707.07263
Krüger J, Rüdiger W (2005) Linear algebra operators for GPU implementation of numerical algorithms. ACM SIGGRAPH 2005 Courses
Jeff B et al (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans Graph (TOG) 22(3):917–924
Article Google Scholar
AMD Corporation (2020) White Paper of AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE . https://www.amd.com/system/files/documents/polaris-whitepaper.pdf
Paul B, Noel C, Nick C, Chip F, Joe G, Nicholas M, Damon M, Scott M, van Oostrum R, Wolfe N (2019) Introduction to AMD GPU programming with HIP
Chellapilla K, Sidd P, Patrice S (2006) High performance convolutional neural networks for document processing
Warden P (2015) Why GEMM is at the heart of deep learning. Peter Warden’s Blog
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication, (2017) IEEE 28th International Conference on Application-specific Systems. Architectures and Processors (ASAP), IEEE
Shi S et al (2016) Benchmarking state-of-the-art deep learning software tools. 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE
Zhang WL, Chen MY, Feng SZ (2004) Analysis and optimization discussion on parallel Linpack. Institute of computing technology Chinese academy of sciences eighth graduate symposium on computer science and technology, DaLian, China
Bach M et al (2011) Optimized HPL for AMD GPU and multi-core CPU usage. Comput Sci Res Dev 26(3–4):153
Article Google Scholar
Auer Alexander A et al (2006) Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol Phys 104(2):211–228
Article Google Scholar
Khodayari A et al (2014) A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab Eng 25:50–62
Article Google Scholar
Ahmad A et al (2017) Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. Proc Int Conf Supercomput 5:1 Supercomputing
Google Scholar
Cho M, Daniel B (2017) MEC: memory-efficient convolution for deep neural network. arXiv preprint arXiv:1706.06873
Yan D, Wei W, Xiaowen C (2020) Optimizing batched winograd convolution on GPUs. Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming
Anderson A et al (2017) Low-memory gemm-based convolution algorithms for deep neural networks. arXiv preprint arXiv:1709.03395
Dongarra J et al (2016) A proposed API for batched basic linear algebra subprograms
Messer, OE Bronson, et al (2012) Multicore and accelerator development for a leadership-class stellar astrophysics code. International workshop on applied parallel computing. Springer, Berlin
Anderson MJ, David S, Kurt K (2012) A predictive model for solving small linear algebra problems in gpu registers. 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE
Sherwin J, Karniadakis GE (2005) Spectral/hp element methods for computational fluid dynamics. Oxford Sci Public 17:18
MATH Google Scholar
Jack D et al (2017) The design and performance of batched blas on modern high-performance computing systems. Proc Comput Sci 108:495
Article Google Scholar
Ali C, Keyes D, Ltaief H (2018) Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans Math Softw 45:2
MathSciNet MATH Google Scholar
Li Xiuhong, et al (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. the 24th Symposium
Abdelfattah A, Tomov S, Dongarra J (2020) Matrix multiplication on batches of small matrices in half and half-complex precisions. J Parall Distrib Comput 145:188
Article Google Scholar
Masliah I, Abdelfattah A, Haidar A et al (2018) Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices. Parall Comput 81:1–21
Article MathSciNet Google Scholar
Valero-Lara P, Martinez-Perez I, Mateo S et al (2018) Variable Batched DGEMM.
Boukaram Halim W, Turkiyyah, et al (2018) Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel computing
Dong T, Da Hai RA, Tomov S et al (2018) Accelerating the SVD Bi-diagonalization of a batch of small matrices using GPUs. J Comput Sci 26:237
Article Google Scholar

Download references

Acknowledgements

This research from the South China University of Technology is supported by AMD and in part by Guangzhou Produce & Research Fund under Grant no. 201902020004. The Radeon Technologies Group (RTG) provided research facilities for this study.

Author information

Authors and Affiliations

School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
Ruimin Wang, Zhiwei Yang, Hao Xu & Lu Lu

Authors

Ruimin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lu Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, R., Yang, Z., Xu, H. et al. A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution. J Supercomput 78, 1741–1758 (2022). https://doi.org/10.1007/s11227-021-03936-9

Download citation

Accepted: 07 June 2021
Published: 21 June 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03936-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Performance improvement of the triangular matrix product in commodity clusters

Can GPU performance increase faster than the code error rate?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Performance improvement of the triangular matrix product in commodity clusters

Can GPU performance increase faster than the code error rate?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation