Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Tabik, S.; Ortega, G.; Garzón, E. M.

doi:10.1007/s11227-014-1102-4

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Published: 23 January 2014

Volume 70, pages 577–587, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

S. Tabik¹,
G. Ortega² &
E. M. Garzón²

353 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between \(1.09\times \) and \(1.27\times \) faster on three GPUs of different characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Article 04 September 2019

An energy-efficient GMRES–multigrid solver for space-time finite element computation of dynamic poroelasticity

Article Open access 13 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Notes

References

Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47(5):1162–1165
Article Google Scholar
Filipovič J, Madzin M, Fousek J, Matyska L (2013) Optimizing cuda code by kernel fusion—application on BLAS. CoRR abs/1305.1183
Gaikwad A, Toke IM (2010) Parallel iterative linear solvers on GPU: a financial engineering case. In: Proceediongs of PDP, pp 607–614
Garcia N (2010) Parallel power flow solutions using a biconjugate gradient algorithm and a newton method: a GPU-based approach. In: IEEE Power and Energy Society general meeting, pp 1–4
Golub GH, van Van Loan CF (1996) Matrix computations (Johns Hopkins studies in mathematical sciences), 3rd edn. The Johns Hopkins University Press. Baltimore, MD
Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of of IEEE IPDPS, pp 25–35
Hwu W (2011) Computing Gems Jade Edition. Applications of GPU computing series, Jade edn. Elsevier Science, Amsterdam
Google Scholar
Lanczos C (1952) Solution of systems of linear equations by minimized iterations. J Res Natl Bur Stand 49:33–53
Article MathSciNet Google Scholar
Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308–323
Article MATH Google Scholar
Navarro AG, Asenjo R, Tabik S, Cascaval C (2009) Analytical modeling of pipeline parallelism. In: Proceedings of PACT, pp 281–290. IEEE Computer Society
NVIDIA (2013) Du-06702-001\_v5.5 CUBLAS user guide. Technical report. http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf
NVIDIA (2013) Du-06709-001\_v5.5 CUSPARSE library. Technical report. http://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf
Ortega G, Garzón EM, Vázquez F, García I (2013) The biconjugate gradient method on GPUs. J Supercomput 64:49–58
Article Google Scholar
Vázquez F, Fernández JJ, Garzón EM (2012) Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput 38:408–420
Article Google Scholar
Vázquez F, Ortega G, Fernández JJ, Garzón EM (2010) Improving the performance of the sparse matrix vector product with GPUs. In: Proceedings of IEEE CIT, pp 1146–1151. IEEE Computer Society
Wozniak M, Olas T, Wyrzykowski R (2010) Parallel implementation of conjugate gradient method on graphics processors. In: Parallel processing and applied mathematics, LNCS vol 6067, pp 125–135
Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: Proceedings of IEEE IPDPSW, pp 2433–2442

Download references

Author information

Authors and Affiliations

Department of Computer Architecture, University of Málaga, 29080 , Málaga, Spain
S. Tabik
Department of Computer Architecture and Electronics, University of Almería, Agrifood Campus of Int. Excell., ceiA3, 04120 , Almería, Spain
G. Ortega & E. M. Garzón

Authors

S. Tabik
View author publications
You can also search for this author in PubMed Google Scholar
G. Ortega
View author publications
You can also search for this author in PubMed Google Scholar
E. M. Garzón
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to G. Ortega.

Additional information

The work was supported by the Spanish Ministry of Science and Innovation through projects TIN2010-16144, TIN2008-01117, TIN2012-37483-C03-03/01, Junta of Andalucía through project P10-TIC-6002 and CAPAP-H4 network (TIN2011-15734-E). G. Ortega is fellow of the Spanish FPU program. We thank NVIDIA for hardware donation under Professor Partnership 2008–2010, CUDA Teaching Center 2012–2013 and CUDA Research Center 2013 Awards.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tabik, S., Ortega, G. & Garzón, E.M. Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J Supercomput 70, 577–587 (2014). https://doi.org/10.1007/s11227-014-1102-4

Download citation

Published: 23 January 2014
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11227-014-1102-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Abstract

Access this article

Similar content being viewed by others

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

An energy-efficient GMRES–multigrid solver for space-time finite element computation of dynamic poroelasticity

Parallelizing the dual revised simplex method

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

Abstract

Access this article

Similar content being viewed by others

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

An energy-efficient GMRES–multigrid solver for space-time finite element computation of dynamic poroelasticity

Parallelizing the dual revised simplex method

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation