Improving Performance of Triangular Matrix-Vector BLAS Routines on GPUs

Karwacki, Marek; Stpiczyński, Przemysław

doi:10.3233/978-1-61499-041-3-405

loading subjects...

Improving Performance of Triangular Matrix-Vector BLAS Routines on GPUs

Authors

Marek Karwacki, Przemysław Stpiczyński

Pages

405 - 412

DOI

10.3233/978-1-61499-041-3-405

Series

Advances in Parallel Computing

Ebook

Volume 22: Applications, Tools and Techniques on the Road to Exascale Computing

Abstract

CUBLAS is a widely used implementation of BLAS (Basic Linear Algebra Subprograms) for NVIDIA CUDA Graphical Processing Units (GPUs). The aim of this paper is to show that the performance of the selected Level 2 BLAS routines for working with triangular matrices can be improved using some optimization techniques suitable for GPUs like using shared memory and coalesced memory access. We present new implementations of the routines xTRMV and xTRSV. The results of experiments carried out on two GPU architectures: Tesla M2050 and GeForce GTX 260 show that these new implementations are up to 500% faster than corresponding routines from CUBLAS Library.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies