Abstract
Forward and back substitution algorithms are widely used for solving linear systems of equations after performing LU decomposition on the coefficient matrix. They are also essential in the implementation of high performance preconditioners which improve the convergence properties of the various iterative methods. In this paper, we describe an efficient approach to implementing forward and back substitution algorithms on a GPU and provide the implementation details of these algorithms on a Modified Incomplete Cholesky Preconditioner for the Conjugate Gradient (CG) algorithm. The resulting forward and back substitution algorithms are then used on a Modified Incomplete Cholesky Preconditioned Conjugate Gradient method to solve the sparse, symmetric, positive definite and linear systems of equations arising from the discretization of three dimensional finite difference ground-water flow models. By utilizing multiple threads, the proposed method yields speedups up to 60 times on GeForce GTX 280 compared to CPU implementation and up to 4.8 times speedup compared to cuSPARSE library function optimized for GPU by NVIDIA.
Similar content being viewed by others
References
Aji AM, Feng WC (2008) Accelerating data-serial applications on data-parallel GPGPUs: a systems approach. Tech rep. http://eprints.cs.vt.edu/archive/00001052/01/ipdps08.pdf
Ament M, Knittel G, Weiskopf D, Strasser W (2010) A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-gpu platform. In: Proceedings of the 2010 18th Euromicro conference on parallel, distributed and network-based processing, PDP ’10. IEEE Computer Society, Washington, pp 583–592
Balevic A, Rockstroh L, Tausendfreund A, Patzelt S, Goch G, Simon S (2008) Accelerating simulations of light scattering based on finite-difference time-domain method with general purpose gpus. In: Computational science and engineering, CSE ’08. 11th IEEE international conference on, pp 327–334. doi:10.1109/CSE.2008.16
Benzi M (2002) Preconditioning techniques for large linear systems: a survey. J Comput Phys 182:418–477
Bolz J, Farmer I, Grinspun E, Schröder P (2003) Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM Trans Graph 22:917–924
Fung J, Mann S (2008) Using graphics devices in reverse: Gpu-based image processing and computer vision. In: Multimedia and expo, IEEE international conference on, pp 9–12. doi:10.1109/ICME.2008.4607358
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Hill MC (1990) Preconditioned conjugate gradient 2 (pcg2), a computer program for solving ground-water flow equations. Tech rep, United States Geological Survey
Jang H, Park A, Jung K (2008) Neural network implementation using cuda and openmp. In: Computing: techniques and applications, 2008. DICTA ’08. Digital image, pp 155–161. doi:10.1109/DICTA.2008.82
Jung JH (2006) Cholesky decomposition and linear programming on a gpu. Tech rep
Komatitsch D, Michéa D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to nVIDIA graphics cards using cuda. J Parallel Distrib Comput 69(5):451–460
Micikevicius P (2009) 3d finite difference computation on gpus using cuda. In: GPGPU-2: proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM Press, New York, pp 79–84
NVIDIA (2008) Nvidia GeForce GTX 200 gpu architectural overview, second-generation unified gpu architecture for visual computing. www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdf
NVIDIA (2009) Cuda programming guide 2.3. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf
NVIDIA (2009) Nvidia cuda c programming best practices guide. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf
NVIDIA (2009) Nvidia’s next generation cuda compute architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
NVIDIA (2011) Cuda cusparse library
NVIDIA (2011) Cuda programming guide 4.0. http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_C_Programming_Guide.pdf
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Tech rep
Volkov V, Demmel J (2008) Lu, qr and Cholesky factorizations using vector capabilities of gpus. Tech Rep UCB/EECS-2008-49, EECS Department. University of California, Berkeley
Yang C, Ge Z, Chen J, Wang F, Wu Q (2009) Accelerating pqmrcgstab algorithm on gpu. In: UCHPC-MAW ’09: proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop. ACM Press, New York, pp 11–16. doi:10.1145/1531666.1531670
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aksari, Y., Artuner, H. Forward and back substitution algorithms on GPU: a case study on modified incomplete Cholesky Preconditioner for three-dimensional finite difference method. J Supercomput 62, 550–572 (2012). https://doi.org/10.1007/s11227-011-0736-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0736-8