Abstract
The main purpose of this paper is to present a very efficient GPU implementation to compute the trmv, the product of a triangular matrix and a vector. Usually, developers use cuBLAS, a linear algebra library optimized for each of various generations of GPUs, to compute the trmv. To attain better performance than cuBLAS, our GPU implementation of the trmv uses various acceleration technique for latest GPUs. More specifically, our GPU implementation has the following features: (1) only one kernel is called; (2) maximum number of threads are invoked; (3) all memory access to the global memory is coalesced; (4) all memory access to the shared memory has no bank conflict; and (5) shared memory access is minimized by a warp shuffle function. Experimental results for five generations of NVIDIA GPUs for matrices of sizes from \(32\times 32\) to \(\mathrm {16K}\times \mathrm {16K}\) for fp32 show that our GPU implementation is faster than cuBLAS and muBLAS for almost all matrix sizes and GPU generations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Charara, A., Ltaief, H., Keyes, D.: Redesigning triangular dense matrix computations on GPUs. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 477–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3_35
Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: Proceedings of International Symposium on Parallel and Distributed Processing, April 2008
He, G., Gao, J., Wang, J.: Efficient dense matrix-vector multiplication on GPU. Concurr. Comput. Pract. Exp. 30(19), e4705 (2018)
Honda, T., Yamamoto, S., Honda, H., Nakano, K., Ito, Y.: Simple and fast parallel algorithms for the Voronoi map and the Euclidean distance map, with GPU implementations. In: Proceedings of International Conference on Parallel Processing, pp. 362–371, August 2017
Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann, Burlington (2011)
Karwacki, M., Stpiczynski, P.: Improving performance of triangular matrix-vector BLAS routines on GPUs. Adv. Parallel Comput. 22, 405–412 (2012)
Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., Nakano, K.: Tile art image generation using conditional generative adversarial networks. In: Proceedings of International Symposium on Computing and Networking Workshops, pp. 209–215 (2018)
Mukunoki, D., Imamura, T., Takahashi, D.: Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs. In: Proceedings of International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, June 2016
Muramatsu, J., Fukaya, T., Zhang, S.L., Kimura, K., Yamamoto, Y.: Acceleration of Hessenberg reduction for nonsymmetric eigenvalue problems in a hybrid CPU-GPU computing environment. Int. J. Netw. Comput. 1(2), 132–143 (2011)
NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)
NVIDIA Corporation: CUBLAS LIBRARY user guide, February 2019. https://docs.nvidia.com/cuda/cublas/index.html
Ogawa, K., Ito, Y., Nakano, K.: Efficient Canny edge detection using a GPU. In: Proceedings of International Conference on Networking and Computing, pp. 279–280. IEEE CS Press, November 2010
Takeuchi, Y., Takafuji, D., Ito, Y., Nakano, K.: ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp. 194–200, December 2013
Tokura, H., et al.: An efficient GPU implementation of bulk computation of the eigenvalue problem for many small real non-symmetric matrices. Int. J. Netw. Comput. 7(2), 227–247 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Inoue, T., Tokura, H., Nakano, K., Ito, Y. (2020). Efficient Triangular Matrix Vector Multiplication on the GPU. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-43229-4_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)