SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Al-Mouhamed, Mayez A.; Khan, Ayaz H.

doi:10.1007/s11227-017-1972-3

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Published: 24 March 2017

Volume 73, pages 3761–3795, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Mayez A. Al-Mouhamed¹ &
Ayaz H. Khan²

448 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Article 17 August 2023

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

A Direct Elliptic Solver Based on Hierarchically Low-Rank Schur Complements

References

Abu-Sufah W, Karim A (2012) An effective approach for implementing sparse matrix–vector multiplication on graphics processing units. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 453–460
Ahamed C, Kassim A, Frdric M (2012) Iterative methods for sparse linear systems on graphics processing unit. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 836–842
Alejandro D, Polanco R (2009) Collective communication and barrier synchronization on NVIDIA CUDA GPU. MS thesis, University of Kentucky
Aliaga J, Perez J, Quintana-Orti E, Anzt H (2013) Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp 320–329
Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c/sell-c-\(\sigma \) formats on NVIDIA GPUs. Technical report on UT-EECS-14-727, University of Tennessee
Anzt H, Tomov S, Luszczek P, Sawyer W, Dongarra J (2015) Acceleration of GPU-based krylov solvers via data transfer reduction. Int J High Perform Comput Appl 29(3):366–383
Article Google Scholar
Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Eijkhout V, Gropp W, Kaushik D, Knepley M et al (2014) PETSc users manual revision 3.5. Technical report on Argonne National Laboratory (ANL)
Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Technical report on NVR-2008-004, NVIDIA Corporation
Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09. ACM, pp 18:1–18:11
Bell N, Garland M (2012) CUSP: generic parallel algorithms for sparse matrix and graph computations. https://code.google.com/archive/p/cusp-library/
Bordawekar R, Baskaran MM (2008) Optimizing sparse matrix–vector multiplication on GPUs. Technical report on RC24704, IMB Research
Buatois L, Caumon G, Lvy B (2009) Concurrent number cruncher: a GPU implementation of a general sparse linear solver. Int J Parallel Emerg Distrib Syst 24(3):205–223
Article MathSciNet Google Scholar
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1:1–1:25
MathSciNet MATH Google Scholar
Goharian N, Jain A, Sun Q (2003) Comparative analysis of sparse matrix algorithms for information retrieval. Computer 2:0–4
Google Scholar
Greathouse JL, Daga M (2014) Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14. IEEE Press, Piscataway, pp 769–780
Grillo L, de Sande F, Reyes R (2014) Performance evaluation of OpenACC compilers. In: 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 656–663
Heroux MA, Bartlett RA, Howle VE, Hoekstra RJ, Hu JJ, Kolda TG, Lehoucq RB, Long KR, Pawlowski RP, Phipps ET et al (2005) An overview of the Trilinos project. ACM Trans Math Softw 31(3):397–423
Article MathSciNet MATH Google Scholar
Hoberock J, Bell N (2010) Thrust: a parallel template library. https://thrust.github.io/
Huan G, Qian Z (2012) A new method of sparse matrix–vector multiplication on GPU. In: 2nd International Conference on Computer Science and Network Technology (ICCSNT), pp 954–958
Khan AH, Al-Mouhamed M, Firdaus LA (2015) Evaluation of global synchronization for iterative algebra algorithms on many-core. In: 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD), pp 1–16
Lee S, Vetter JS (2012) Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12. IEEE Computer Society Press, Los Alamitos, pp 23:1–23:11
Lowell D, Godwin J, Holewinski J, Karthik D, Choudary C, Mametjanov A, Norris B, Sabin G, Sadayappan P, Sarich J (2013) Stencil-aware GPU optimization of iterative solvers. SIAM J Sci Comput 35(5):S209–S228
Article MathSciNet MATH Google Scholar
Maggioni M, Berger-Wolf T (2016) Optimization techniques for sparse matrixvector multiplication on {GPUs}. J Parallel Distrib Comput 9394:66–86
Article Google Scholar
Matam KK, Kothapalli K (2011) Accelerating sparse matrix vector multiplication in iterative methods using GPU. In: IEEE International Conference on Parallel Processing (ICPP), pp 612–621
Nagasaka Y, Nukada A, Matsuoka S (2016) Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Comput Sci 80:131–142 (International Conference on Computational Science 2016)
Neelima B, Reddy GRM, Raghavendra PS (2014) Predicting an optimal sparse matrix format for SpMV computation on GPU. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp 1427–1436
NVIDIA: CUDA basic linear algebra subroutines (cublas) library, CUDA sparse matrix (cusparse) library. https://developer.nvidia.com/gpu-accelerated-libraries
NVIDIA (2013) Tuning CUDA applications for kepler. http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html. Accessed 10 June 2013
NVIDIA (2008) NVIDIA CUDA programming guide 2.0. NVIDIA
NVIDIA (2011) CUDA toolkit. https://developer.nvidia.com/cuda-toolkit
NVIDIA (2014) cuSparse library. https://developer.nvidia.com/cusparse
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2007) A survey of general-purpose computation on graphics hardware. In: Computer graphics forum, vol 26. Wiley Online Library, pp 80–113
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240
Article MATH Google Scholar
Xiao S, c. Feng W (2010) Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp 1–12
Yan SEA (2014) yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49(8):107–118
Article Google Scholar
Yang LT (2000) Data distribution and communication schemes for IQMR method on massively distributed memory computers. In: Proceedings of the International Workshop on Parallel Processing, ICPPW, Toronto, Canada, August 21–24, pp 299–306
Zaza A (2015) A CUDA based parallel multi-phase oil reservoir simulator. PhD thesis, Computer Engineering Department, King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arabia

Download references

Acknowledgements

This project was funded by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology- through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks for King Fahd University of Petroleum & Minerals (KFUPM) for computing support.

Author information

Authors and Affiliations

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Mayez A. Al-Mouhamed
Computer Science Department, Qassim University, Buraydah, Saudi Arabia
Ayaz H. Khan

Authors

Mayez A. Al-Mouhamed
View author publications
You can also search for this author in PubMed Google Scholar
Ayaz H. Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayez A. Al-Mouhamed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Mouhamed, M.A., Khan, A.H. SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73, 3761–3795 (2017). https://doi.org/10.1007/s11227-017-1972-3

Download citation

Published: 24 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11227-017-1972-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Abstract

Access this article

Similar content being viewed by others

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

A Direct Elliptic Solver Based on Hierarchically Low-Rank Schur Complements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Abstract

Access this article

Similar content being viewed by others

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

A Direct Elliptic Solver Based on Hierarchically Low-Rank Schur Complements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation