Skip to main content

Efficient Triangular Matrix Vector Multiplication on the GPU

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Abstract

The main purpose of this paper is to present a very efficient GPU implementation to compute the trmv, the product of a triangular matrix and a vector. Usually, developers use cuBLAS, a linear algebra library optimized for each of various generations of GPUs, to compute the trmv. To attain better performance than cuBLAS, our GPU implementation of the trmv uses various acceleration technique for latest GPUs. More specifically, our GPU implementation has the following features: (1) only one kernel is called; (2) maximum number of threads are invoked; (3) all memory access to the global memory is coalesced; (4) all memory access to the shared memory has no bank conflict; and (5) shared memory access is minimized by a warp shuffle function. Experimental results for five generations of NVIDIA GPUs for matrices of sizes from \(32\times 32\) to \(\mathrm {16K}\times \mathrm {16K}\) for fp32 show that our GPU implementation is faster than cuBLAS and muBLAS for almost all matrix sizes and GPU generations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Charara, A., Ltaief, H., Keyes, D.: Redesigning triangular dense matrix computations on GPUs. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 477–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3_35

    Chapter  Google Scholar 

  2. Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: Proceedings of International Symposium on Parallel and Distributed Processing, April 2008

    Google Scholar 

  3. He, G., Gao, J., Wang, J.: Efficient dense matrix-vector multiplication on GPU. Concurr. Comput. Pract. Exp. 30(19), e4705 (2018)

    Article  Google Scholar 

  4. Honda, T., Yamamoto, S., Honda, H., Nakano, K., Ito, Y.: Simple and fast parallel algorithms for the Voronoi map and the Euclidean distance map, with GPU implementations. In: Proceedings of International Conference on Parallel Processing, pp. 362–371, August 2017

    Google Scholar 

  5. Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann, Burlington (2011)

    Google Scholar 

  6. Karwacki, M., Stpiczynski, P.: Improving performance of triangular matrix-vector BLAS routines on GPUs. Adv. Parallel Comput. 22, 405–412 (2012)

    Google Scholar 

  7. Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., Nakano, K.: Tile art image generation using conditional generative adversarial networks. In: Proceedings of International Symposium on Computing and Networking Workshops, pp. 209–215 (2018)

    Google Scholar 

  8. Mukunoki, D., Imamura, T., Takahashi, D.: Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs. In: Proceedings of International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, June 2016

    Google Scholar 

  9. Muramatsu, J., Fukaya, T., Zhang, S.L., Kimura, K., Yamamoto, Y.: Acceleration of Hessenberg reduction for nonsymmetric eigenvalue problems in a hybrid CPU-GPU computing environment. Int. J. Netw. Comput. 1(2), 132–143 (2011)

    Google Scholar 

  10. NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)

    Google Scholar 

  11. NVIDIA Corporation: CUBLAS LIBRARY user guide, February 2019. https://docs.nvidia.com/cuda/cublas/index.html

  12. Ogawa, K., Ito, Y., Nakano, K.: Efficient Canny edge detection using a GPU. In: Proceedings of International Conference on Networking and Computing, pp. 279–280. IEEE CS Press, November 2010

    Google Scholar 

  13. Takeuchi, Y., Takafuji, D., Ito, Y., Nakano, K.: ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp. 194–200, December 2013

    Google Scholar 

  14. Tokura, H., et al.: An efficient GPU implementation of bulk computation of the eigenvalue problem for many small real non-symmetric matrices. Int. J. Netw. Comput. 7(2), 227–247 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Koji Nakano .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Inoue, T., Tokura, H., Nakano, K., Ito, Y. (2020). Efficient Triangular Matrix Vector Multiplication on the GPU. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43229-4_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43228-7

  • Online ISBN: 978-3-030-43229-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics