Skip to main content
Log in

Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between \(1.09\times \) and \(1.27\times \) faster on three GPUs of different characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.netlib.org/lapack/#_related_projects.

  2. http://docs.nvidia.com/cuda/cublas/index.html.

References

  1. Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47(5):1162–1165

    Article  Google Scholar 

  2. Filipovič J, Madzin M, Fousek J, Matyska L (2013) Optimizing cuda code by kernel fusion—application on BLAS. CoRR abs/1305.1183

  3. Gaikwad A, Toke IM (2010) Parallel iterative linear solvers on GPU: a financial engineering case. In: Proceediongs of PDP, pp 607–614

  4. Garcia N (2010) Parallel power flow solutions using a biconjugate gradient algorithm and a newton method: a GPU-based approach. In: IEEE Power and Energy Society general meeting, pp 1–4

  5. Golub GH, van Van Loan CF (1996) Matrix computations (Johns Hopkins studies in mathematical sciences), 3rd edn. The Johns Hopkins University Press. Baltimore, MD

  6. Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of of IEEE IPDPS, pp 25–35

  7. Hwu W (2011) Computing Gems Jade Edition. Applications of GPU computing series, Jade edn. Elsevier Science, Amsterdam

    Google Scholar 

  8. Lanczos C (1952) Solution of systems of linear equations by minimized iterations. J Res Natl Bur Stand 49:33–53

    Article  MathSciNet  Google Scholar 

  9. Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308–323

    Article  MATH  Google Scholar 

  10. Navarro AG, Asenjo R, Tabik S, Cascaval C (2009) Analytical modeling of pipeline parallelism. In: Proceedings of PACT, pp 281–290. IEEE Computer Society

  11. NVIDIA (2013) Du-06702-001\_v5.5 CUBLAS user guide. Technical report. http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf

  12. NVIDIA (2013) Du-06709-001\_v5.5 CUSPARSE library. Technical report. http://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf

  13. Ortega G, Garzón EM, Vázquez F, García I (2013) The biconjugate gradient method on GPUs. J Supercomput 64:49–58

    Article  Google Scholar 

  14. Vázquez F, Fernández JJ, Garzón EM (2012) Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput 38:408–420

    Article  Google Scholar 

  15. Vázquez F, Ortega G, Fernández JJ, Garzón EM (2010) Improving the performance of the sparse matrix vector product with GPUs. In: Proceedings of IEEE CIT, pp 1146–1151. IEEE Computer Society

  16. Wozniak M, Olas T, Wyrzykowski R (2010) Parallel implementation of conjugate gradient method on graphics processors. In: Parallel processing and applied mathematics, LNCS vol 6067, pp 125–135

  17. Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: Proceedings of IEEE IPDPSW, pp 2433–2442

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Ortega.

Additional information

The work was supported by the Spanish Ministry of Science and Innovation through projects TIN2010-16144, TIN2008-01117, TIN2012-37483-C03-03/01, Junta of Andalucía through project P10-TIC-6002 and CAPAP-H4 network (TIN2011-15734-E). G. Ortega is fellow of the Spanish FPU program. We thank NVIDIA for hardware donation under Professor Partnership 2008–2010, CUDA Teaching Center 2012–2013 and CUDA Research Center 2013 Awards.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tabik, S., Ortega, G. & Garzón, E.M. Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study. J Supercomput 70, 577–587 (2014). https://doi.org/10.1007/s11227-014-1102-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1102-4

Keywords

Navigation