skip to main content
10.1145/3293883.3295734acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

A coordinated tiling and batching framework for efficient GEMM on GPUs

Published:16 February 2019Publication History

ABSTRACT

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs.

However, the current support for batched GEMM is still rudimentary. Tiling and batching are tightly correlated. A large tile size can increase the data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMMs on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched GEMM on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface for the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework can achieve 1.23X speedup.

References

  1. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In High Performance Computing. 21--38.Google ScholarGoogle Scholar
  2. Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs. In Proceedings of the International Conference on Supercomputing. 5:1--5:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv e-prints (2014).Google ScholarGoogle Scholar
  4. Andrzej Chrzeszczyk. 2017. Matrix computations on the GPU. CUBLAS, CUSOLVER and MAGMA by example. Version 2017.Google ScholarGoogle Scholar
  5. Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMM. (2017).Google ScholarGoogle Scholar
  6. Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing-Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012). 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarGoogle Scholar
  8. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. {n. d.}. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv e-prints ({n. d.}), arXiv:1602.07360.Google ScholarGoogle Scholar
  9. Changhao Jiang and M. Snir. 2005. Automatic tuning matrix multiplication performance on graphics hardware. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 185--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jakub Kurzak, Stanimire Tomov, and Jack Dongarra. 2012. Autotuning GEMM Kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23, 11 (2012), 2045--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 297--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. 2015. Fine-Grained Synchronizations and Dataflow Programming on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xiuhong Li and Yun Liang. 2016. Efficient Kernel Management on GPUs. In Proceedings of the 2016 Conference on Design, Automation & Test in Europe. 85--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. In Proceedings of the 2018 International Conference on Supercomputing. 184--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuning GEMM for GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I. 884--892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2015. Efficient GPU Spatial-Temporal Multitasking. IEEE Transactions on Parallel and Distributed Systems 26, 3 (2015), 748--760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yun Liang and Xiuhong Li. 2017. Efficient Kernel Management on GPUs. ACM Transaction on Embedded Computing System 16, 4 (2017), 115:1--115:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yun Liang, Xiuhong Li, and Xiaolong Xie. 2017. Exploring Cache Bypassing and Partitioning for Multi-tasking on GPUs. In Proceedings of the 36th International Conference on Computer-Aided Design. 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. International Journal of High Performance Computing Applications 24, 4 (2010), 511--515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NVIDIA. 2018. CUDA Documentation. http://docs.nvidia.com/cuda/cublas/index.html. (2018).Google ScholarGoogle Scholar
  22. NVIDIA. 2018. CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/. (2018).Google ScholarGoogle Scholar
  23. Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. 99--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register Optimizations for Stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 168--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014).Google ScholarGoogle Scholar
  26. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 35:1--35:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs. IEEE Trans. Comput. 67, 6 (2018), 890--897.Google ScholarGoogle ScholarCross RefCross Ref
  29. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design. 516--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 21st IEEE International Symposium on High Performance Computer Architecture. 76--88.Google ScholarGoogle ScholarCross RefCross Ref
  31. Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 31--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: A Versatile Programming Framework for Pipelined Computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 587--599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, and Ninghui Sun. 2017. A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability. In Proceedings of the International Conference on Supercomputing. 15:1--15:10. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A coordinated tiling and batching framework for efficient GEMM on GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
        February 2019
        472 pages
        ISBN:9781450362252
        DOI:10.1145/3293883

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 February 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PPoPP '19 Paper Acceptance Rate29of152submissions,19%Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader