Skip to main content

Performance, Design, and Autotuning of Batched GEMM for GPUs

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

Abstract

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.

This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Earl, C., Falcou, J., Haidar, A., Karlin, I., Kolev, T., Masliah, I., Tomov, S.: High-performance tensor contractions for GPUs. In: International Conference on Computational Science (ICCS 2016). Elsevier, Procedia Computer Science, San Diego, CA, USA, June 2016

    Google Scholar 

  2. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)

    Google Scholar 

  3. Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)

    Google Scholar 

  4. Dong, T., Haidar, A., Luszczek, P., Harris, A., Tomov, S., Dongarra, J.: LU Factorization of small matrices: accelerating batched DGETRF on the GPU. In: Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014) August 2014

    Google Scholar 

  5. Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU. In: IEEE 28th International Parallel Distributed Processing Symposium (IPDPS) (2014)

    Google Scholar 

  6. Gray, S.: A full walk through of the SGEMM implementation (2015). https://github.com/NervanaSystems/maxas/wiki/SGEMM

  7. Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. Int. J. High Perform. Comput. Appl. (2015). http://hpc.sagepub.com/content/early/2015/02/06/1094342014567546.abstract

  8. Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J.: A framework for batched and GPU-resident factorization algorithms applied to block householder transformations. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 31–47. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  9. Im, E.J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J High Perform Comput. Appl. 18(1), 135–158 (2004). http://dx.doi.org/10.1177/1094342004041296

    Article  Google Scholar 

  10. Jhurani, C., Mullowney, P.: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. CoRR abs/1304.7053 (2013). http://arxiv.org/abs/1304.7053

  11. Khodayari, A., Zomorrodi, A.R., Liao, J.C., Maranas, C.: A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C, 50–62 (2014)

    Article  Google Scholar 

  12. Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)

    Article  Google Scholar 

  13. Lai, J., Seznec, A.: Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO 2013, pp. 1–10. IEEE Computer Society, Washington, DC, USA (2013). http://dx.doi.org/10.1109/CGO.2013.6494986

  14. Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  15. Lopez, M., Horton, M.: Batch matrix exponentiation. In: Kindratenko, V. (ed.) Numerical Computations with GPUs, pp. 45–67. Springer International Publishing (2014), http://dx.doi.org/10.1007/978-3-319-06548-9_3

  16. Messer, O.E.B., Harris, J.A., Parete-Koon, S., Chertkow, M.A.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen, P., Öster, P. (eds.) PARA 2012. LNCS, vol. 7782, pp. 92–106. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  17. Molero, J., Garzón, E., García, I., Quintana-Ortí, E., Plaza, A.: Poster: a batched Cholesky solver for local RX anomaly detection on GPUs, PUMPS (2013)

    Google Scholar 

  18. Nath, R., Tomov, S., Dongarra, J.: An improved magma GEMM for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010). http://dx.doi.org/10.1177/1094342010385729

    Article  Google Scholar 

  19. Tan, G., Li, L., Triechle, S., Phillips, E., Bao, Y., Sun, N.: Fast implementation of DGEMM on Fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 35:1–35:11. ACM, New York (2011). http://doi.acm.org/10.1145/2063384.2063431

  20. Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parellel Comput. Syst. Appl. 36(5–6), 232–240 (2010)

    Article  MATH  Google Scholar 

  21. Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1–11. IEEE Press, Piscataway (2008)

    Google Scholar 

  22. Yeralan, S.N., Davis, T.A., Ranka, S.: Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical Report (2013). http://faculty.cse.tamu.edu/davis/publications_files/qrgpu_paper.pdf

Download references

Acknowledgment

This work is based upon work supported by the National Science Foundation under Grants No. ACI-1339822 and CSR 1514286, NVIDIA, the Department of Energy (LLNL subcontract under DOE contract DE-AC52-07NA27344), and in part by the Russian Scientific Foundation, Agreement N14-11-00190.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Abdelfattah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J. (2016). Performance, Design, and Autotuning of Batched GEMM for GPUs. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41321-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41320-4

  • Online ISBN: 978-3-319-41321-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics