skip to main content
10.1145/3293883.3295734acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

A coordinated tiling and batching framework for efficient GEMM on GPUs

Published: 16 February 2019 Publication History


General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs.
However, the current support for batched GEMM is still rudimentary. Tiling and batching are tightly correlated. A large tile size can increase the data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMMs on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched GEMM on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface for the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework can achieve 1.23X speedup.


Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In High Performance Computing. 21--38.
Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs. In Proceedings of the International Conference on Supercomputing. 5:1--5:10.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv e-prints (2014).
Andrzej Chrzeszczyk. 2017. Matrix computations on the GPU. CUBLAS, CUSOLVER and MAGMA by example. Version 2017.
Scott Gray. 2017. A full walk through of the SGEMM implementation. (2017).
Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing-Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012). 1--14.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. {n. d.}. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv e-prints ({n. d.}), arXiv:1602.07360.
Changhao Jiang and M. Snir. 2005. Automatic tuning matrix multiplication performance on graphics hardware. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 185--194.
Jakub Kurzak, Stanimire Tomov, and Jack Dongarra. 2012. Autotuning GEMM Kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23, 11 (2012), 2045--2057.
Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10.
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 297--311.
Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. 2015. Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. 109--118.
Xiuhong Li and Yun Liang. 2016. Efficient Kernel Management on GPUs. In Proceedings of the 2016 Conference on Design, Automation & Test in Europe. 85--90.
Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. In Proceedings of the 2018 International Conference on Supercomputing. 184--194.
Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning GEMM for GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I. 884--892.
Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2015. Efficient GPU Spatial-Temporal Multitasking. IEEE Transactions on Parallel and Distributed Systems 26, 3 (2015), 748--760.
Yun Liang and Xiuhong Li. 2017. Efficient Kernel Management on GPUs. ACM Transaction on Embedded Computing System 16, 4 (2017), 115:1--115:24.
Yun Liang, Xiuhong Li, and Xiaolong Xie. 2017. Exploring Cache Bypassing and Partitioning for Multi-tasking on GPUs. In Proceedings of the 36th International Conference on Computer-Aided Design. 9--16.
Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. International Journal of High Performance Computing Applications 24, 4 (2010), 511--515.
NVIDIA. 2018. CUDA Documentation. (2018).
NVIDIA. 2018. CUTLASS: Fast Linear Algebra in CUDA C++. (2018).
Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. 99--111.
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 168--182.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014).
Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 35:1--35:11.
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. 395--406.
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs. IEEE Trans. Comput. 67, 6 (2018), 890--897.
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design. 516--523.
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 21st IEEE International Symposium on High Performance Computer Architecture. 76--88.
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 31--43.
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: A Versatile Programming Framework for Pipelined Computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 587--599.
Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. 2017. A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability. In Proceedings of the International Conference on Supercomputing. 15:1--15:10.

Cited By

View all
  • (2025)A load-balanced acceleration method for small and irregular batch matrix multiplication on GPUJournal of Systems Architecture10.1016/j.sysarc.2025.103341160(103341)Online publication date: Mar-2025
  • (2025)Optimizing 2D convolution for DCUsCCF Transactions on High Performance Computing10.1007/s42514-024-00205-yOnline publication date: 22-Feb-2025
  • (2024)HSS: enhancing IoT malicious traffic classification leveraging hybrid sampling strategyCybersecurity10.1186/s42400-023-00201-97:1Online publication date: 1-Jun-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines

Publication History

Published: 16 February 2019


Request permissions for this article.

Check for updates


Author Tags

  1. GEMM
  2. GPGPU
  3. batching
  4. tiling


  • Research-article

Funding Sources


PPoPP '19

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)211
  • Downloads (Last 6 weeks)17
Reflects downloads up to 19 Feb 2025

Other Metrics


Cited By

View all
  • (2025)A load-balanced acceleration method for small and irregular batch matrix multiplication on GPUJournal of Systems Architecture10.1016/j.sysarc.2025.103341160(103341)Online publication date: Mar-2025
  • (2025)Optimizing 2D convolution for DCUsCCF Transactions on High Performance Computing10.1007/s42514-024-00205-yOnline publication date: 22-Feb-2025
  • (2024)HSS: enhancing IoT malicious traffic classification leveraging hybrid sampling strategyCybersecurity10.1186/s42400-023-00201-97:1Online publication date: 1-Jun-2024
  • (2024)Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656620(137-149)Online publication date: 30-May-2024
  • (2024)FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural NetworksProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656593(511-524)Online publication date: 30-May-2024
  • (2024)VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar OperationsIEEE Transactions on Computers10.1109/TC.2023.328509573:10(2378-2390)Online publication date: Oct-2024
  • (2024)Low-bit CUTLASS GEMM Template Auto-tuning using Neural Network2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00057(394-401)Online publication date: 30-Oct-2024
  • (2024)High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558334(1-5)Online publication date: 19-May-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • (2024)Optimizing depthwise separable convolution on DCUCCF Transactions on High Performance Computing10.1007/s42514-024-00200-3Online publication date: 13-Dec-2024
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media