Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

doi:10.1016/j.jpdc.2013.10.002

Journal of Parallel and Distributed Computing

Volume 74, Issue 2, February 2014, Pages 2088-2098

https://doi.org/10.1016/j.jpdc.2013.10.002 Get rights and content

Highlights

•
We present a parallel method of the forward/backward substitutions on the GPU.
•
A new kernel of SpMV with high optimization on the GPU is proposed.
•
We suggest a new kernel of inner product on the GPU.
•
An efficient MIC preconditioned conjugate gradient algorithm on the GPU is presented.

Abstract

In this study, we discover the parallelism of the forward/backward substitutions (FBS) for two cases and thus propose an efficient preconditioned conjugate gradient algorithm with the modified incomplete Cholesky preconditioner on the GPU (GPUMICPCGA). For our proposed GPUMICPCGA, the following are distinct characteristics: (1) the vector operations are optimized by grouping several vector operations into single kernels, (2) a new kernel of inner product and a new kernel of the sparse matrix–vector multiplication with high optimization are presented, and (3) an efficient parallel implementation of FBS on the GPU (GPUFBS) for two cases are suggested. Numerical results show that our proposed kernels outperform the corresponding ones presented in CUBLAS or CUSPARSE, and GPUFBS is almost 3 times faster than the implementation of FBS using the CUSPARSE library. Furthermore, GPUMICPCGA has better behavior than its counterpart implemented by the CUBLAS and CUSPARSE libraries.

Introduction

Graphic Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications and, with their many core structures, provide a serious opportunity for scientific computations. Processing large data sets in parallel on GPUs has drawn much attentions in recent years. Following the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA in 2007 [16], a programming model designed to support joint CPU/GPU execution of applications, GPUs have become increasingly serious competitors among the general-purpose parallel programming systems.

The conjugate gradient (CG) algorithm is one of the best known iterative methods. With a suitable preconditioner, the performance of the CG algorithm can be dramatically improved. The preconditioned conjugate gradient (PCG) algorithm has proven its efficiency and robustness in a wide range of applications. Recently, many researchers have attempted to develop the suitable and flexible PCG algorithm for the GPU architecture. In [3], an efficient implementation of a Jacobi preconditioned conjugate gradient algorithm on GPUs is developed. However, the Jacobi preconditioner has a limited impact on the efficiency and robustness of the PCG algorithm on the whole. In [10], the authors give a parallel implementation of the SSOR (Symmetric Successive Over-Relaxation) PCG algorithm by utilizing a first order approximation. Numerical results show as compared to the CPU implementation of the CG algorithm, their GPU PCG implementation is up to 10 times faster. A new parallel computational technique is proposed for the parallelization of the explicit inverse preconditioned biconjugate conjugate gradient type method on GPUs in [9]. Other related work can be found in [7], [13], [14].

From the studies mentioned above, it can be found that for each PCG algorithm, one of the key factors is how to parallel solve the equation $M z = r$ ( $M$ is the preconditioned matrix) on GPUs when shifting it to the GPU platform. For the modified incomplete Cholesky factorization (MIC) preconditioning, it has proven its effectiveness in improving the performance of the CG algorithm, and $M = L D L^{T}$ , where $L$ is a lower triangular matrix. However, for solving the equation $L D L^{T} z = r$ , the following two steps are required. First, $L D v = r$ is solved for $v$ by the forward substitution, and then $D L^{T} z = D v$ is solved for $z$ by the backward substitution. It is obviously observed that the forward/backward substitutions are not easy to implement on GPUs, which to a great extent prevents us from shifting the PCG algorithm with the MIC preconditioner to the GPU platform.

Therefore, in this study, we will discuss how to implement the MIC PCG algorithm on the GPU and emphasize whether the forward/backward substitutions can be parallelized on the GPU. Fortunately, in our recent studies, we observe that because $L$ may have the same non-zero entries as the coefficient matrix of the linear system, the forward/backward substitutions can be better parallelized on the GPU if the coefficient matrix of the linear system satisfies some conditions.

Next, we take the following partial differential equation (PDE) for example to represent our recent findings and thus propose an efficient parallel implementation of the MIC PCG algorithm on the GPU: $u_{t} = Δ u, in Ω \times [0, T],$ where $Ω$ is a regular domain in $R^{d}$ , $d = 2, 3$ , and $Δ$ is the Laplace operator, and $u$ is the unknown solution. A solution $u$ of the PDE (1) must satisfy the boundary conditions (Dirichlet, Neumann or Cauchy) and the initial conditions.

For $d = 2$ , we utilize the discrete variational derivative method (DVDM) to discretize the PDE (1). For readers who are interested in the DVDM, please refer to the literature [6]. Assuming that a 3×3 two-dimensional grid, shown in Fig. 1, is defined, the scheme obtained above can be written as the following linear system for a certain time $t \in [0, T]$ : $A_{(2)} u_{(2)} = b_{(2)},$ where

$A_{(2)} = [\begin{matrix} * & * & * \\ * & * & * & * \\ * & * & * \\ * & * & * & * \\ * & * & * & * & * \\ * & * & * & * \\ * & * & * \\ * & * & * & * \\ * & * & * \end{matrix}],$ the coefficient matrix $A_{(2)}$ is a sparse, symmetric, positive definite, five-diagonal, diagonally dominant 9×9 matrix of regular structure, $b_{(2)}$ is a vector of size 9, and $u_{(2)}$ is an unknown solution vector of size 9.

According to the definition of the MIC preconditioning, the preconditioned matrix $M_{(2)}$ of $A_{(2)} = L_{(2)} D_{(2)} L_{(2)}^{T}$ , where $L_{(2)} = [\begin{matrix} 1 \\ ♢ & 1 \\ ♢ & 1 \\ ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \end{matrix}],$ a lower triangular matrix with non-zero values along the main diagonal and at off-diagonal locations only where $A_{(2)}$ has non-zero entries.

Generally, the forward/backward substitutions both need $n$ ( $n$ is the element number of the solution vector) steps to finish calculations of elements. However, it can be observed that the forward substitution for this case both can be finished in 5 steps instead of 9 steps according to the progression shown in Fig. 2.

The computational grid in Fig. 2 is the same as in Fig. 1, and arrows represent dependencies between elements of the grid. From Fig. 2, we can see that the calculation of an element in each step is dependent to its two neighbors (adjacent row and column) and these neighbors must be calculated earlier than it. Also note that not all elements in Fig. 2 have these neighbor elements. For example, the element 7 of the forward substitution is only dependent to its row neighbor element 4. Moreover, the elements of each step are independent of each other and can be concurrently calculated. These observations are also suitable for the backward substitution. All these things are helpful for shifting the forward/backward substitutions to the GPU platform.

In the above case, we investigate a class of five-diagonal matrices like $A_{(2)}$ and introduce how to decrease the number of computational steps and the parallelism embraced in them for the forward/backward substitutions. On the other hand, let us consider the three-dimensional PDE ( $d = 3$ ) in order to investigate a class of seven-diagonal matrices.

Assuming that a 4×2×2 three-dimensional grid, shown in Fig. 3, is defined, the PDE (1) ( $d = 3$ ) is discretized by the DVDM and then is written as the following linear system for a certain time $t \in [0, T]$ : $A_{(3)} u_{(3)} = b_{(3)},$ where $A_{(3)} = [\begin{matrix} * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * \\ * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * \\ * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * \\ * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * \end{matrix}],$ the coefficient matrix $A_{(3)}$ is a sparse, symmetric, positive definite, seven-diagonal, diagonally dominant 16×16 matrix of regular structure, $b_{(3)}$ is a vector of size 16, and $u_{(3)}$ is an unknown solution vector of size 16. Similarly, the MIC preconditioned matrix $M_{(3)} = L_{(3)} D_{(3)} L_{(3)}^{T}$ can be obtained, and $L_{(3)} = [\begin{matrix} 1 \\ ♢ & 1 \\ ♢ & 1 \\ ♢ & 1 \\ ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & 1 \\ ♢ & ♢ & ♢ & 1 \\ ♢ & ♢ & ♢ & 1 \\ ♢ & ♢ & ♢ & 1 \end{matrix}] .$

By analyzing the lower triangular matrix $L_{(3)}$ , we conclude that the computational steps of the forward/backward substitutions both can decrease from 16 to 6 for this case. In Fig. 4, we take the forward substitution for example to show its computational progression.

As compared to the case in Fig. 2, here the calculation of an element in each step does not only depend on its adjacent row and column neighbors, but also depends on its adjacent layer neighbors. For example, the element 14 is dependent on its adjacent row and column elements 10 and 13 and its adjacent layer element 6. However, not all elements have these neighbor elements and the number of neighbors varies with the location of the element. Moreover, for any element, its dependent neighbors must be calculated earlier than it in order to start its calculation. Like the case in Fig. 2, the elements of each step are independent of each other and can be concurrently calculated.

In the following sections, we will furthermore discuss the implementation of the MIC PCG algorithm on the GPU. Compared to other related work, our researches have the following main contributions.

$•$ For the MIC preconditioning, the parallelism of the forward/backward substitutions for two cases is found, and thus an efficient parallel method on the GPU is presented.

$•$ A new method of inner product and a new method of the sparse matrix–vector multiplication (SpMV) on the GPU are proposed. For our proposed SpMV on the GPU, it is optimized by shared memory and fully coalesced global memory.

$•$ Based on the above methods, an optimized MIC PCG algorithm with double precision on the GPU is suggested.

The reminder of this paper is organized as follows. In the second section, the MIC PCG algorithm and the CUDA architecture are described. In the third section, some GPU kernels for the MIC PCG algorithm are presented. Numerical results are presented in the fourth section. The fifth section contains our conclusions and points to our future research direction.

Section snippets

MIC PCG algorithm

In this section, we will discuss the MIC PCG algorithm for solving linear systems (2), (3). In order to simplify our discussion, here the two linear systems (2), (3) are unified as the following linear system: $A u = b,$ where $A$ is a sparse, symmetric, and positive definite five-diagonal or seven-diagonal matrix. In the following section, only when the method or operation we discuss is distinct with different $A$ ( $A$ is $A_{(2)}$ or $A_{(3)}$ ), we will discuss it separately according to $A_{(2)}$ , $A_{(3)}$ in order.

For

GPU kernels and optimization

In this section, the primary kernels used in the implementation of Algorithm 1 are presented. Following Algorithm 1, the basic operations are sparse matrix–vector multiplications, vector operations, inner product of vector and the forward/backward substitutions. In the following subsection, we will describe ways of computing these basic operations.

Numerical results

In this section, we test our proposed MIC PCG algorithm on the GPU (GPUMICPCGA) from the following two aspects: (1) analyzing the performance of our proposed kernels; (2) comparing GPUMICPCGA with a mixed implementation of MIC PCG algorithm using the CUBLAS and CUSPARSE libaries.

All experiments are conducted on the environment based on GNU/Linux Ubuntu v10.04.1 with an Intel Xeon Quad-Core 2.66 GHz, 12 GB RAM (CPU) and NVIDIA Tesla C2050 (ECC), 448 CUDA Cores, 6GB RAM (GPU).

Conclusion

In this study, we discover a parallel method of the forward/backward substitutions on the GPU (GPUFBS) for two cases. Thus we propose an efficient parallel MIC PCG algorithm on the GPU (GPUMICPCGA). In our proposed GPUMICPCGA, most of calculations, such as the forward/backward substitutions, the SpMV, the inner product of vector, and vector operations, are executed on the GPU, and the CPU is only used to execute the following simple operations such as $β = ρ_{1} / ρ_{0}$ , $ρ_{0} = ρ_{1}$ , counting the number of

Acknowledgments

We gratefully acknowledge the comments from the anonymous reviewers, which greatly helped us to improve the contents of the paper.

References (21)

V. Galiano et al.
GPU-based parallel algorithms for sparse nonlinear systems
J. Parallel Distrib. Comput.
(2012)
R. Helfenstein et al.
Parallel preconditioned conjugate gradient algorithm on GPU
J. Comput. Appl. Math.
(2012)
M. Krotkiewski et al.
Parallel symmetric sparse matrix–vector product on scalar multi-core CPUs
Parallel Comput.
(2010)
N. Bell, M. Garland, Implementing sparse matrix–vector multiplication on throughput-oriented processors, in: Proc....
N. Bell, M. Garland, Efficient sparse matrix–vector multiplication on CUDA, Technique report, NVIDIA,...
L. Buatois et al.
Concurrent number cruncher: a GPU implementation of a general sparse linear solver
Int. J. Parallel Emergent Distrib. Syst.
(2009)
T. Davis, University of Florida sparse matrix collection, NA Digest, 1997, see...
X.W. Feng, H. Jin, et al. Optimization of sparse matrix–vector multiplication with variant CSR on GPUs, in: Proceedings...
D. Furihata et al.
Discrete Variational Derivative Method: A Structure-Preserving Numerical Method for Partial Differential Equations
(2011)
P. Georges et al.
High-performance PCG solves for FEM structural analysis
Int. J. Numer. Methods Eng.
(1996)

There are more references available in the full text version of this article.

Cited by (23)

Adaptive diagonal sparse matrix-vector multiplication on GPU
2021, Journal of Parallel and Distributed Computing
For diagonal sparse matrices that have many long zero sections or scatter points or diagonal deviations from the main diagonal, a great number of zeros need be filled to maintain the diagonal structure while using DIA to store them, which leads to the performance degradation of the existing DIA kernels because the padded zeros consume extra computation and memory resources. This motivates us to present an adaptive sparse matrix-vector multiplication (SpMV) for diagonal sparse matrices on the graphics processing unit (GPU), called DIA-Adaptive, to alleviate the drawback of DIA kernels for these cases. For DIA-Adaptive, there are the following characteristics: (1) two new sparse storage formats, BRCSD (Diagonal Compressed Storage based on Row-Blocks)-I and BRCSD-II, are proposed to adapt it to various types of diagonal sparse matrices besides adopting DIA, and SpMV kernels corresponding to these storage formats are presented; and (2) a search engine is designed to choose the most appropriate storage format from DIA, BRCSD-I, and BRCSD-II for any given diagonal sparse matrix; and (3) a code generator is presented to automatically generate SpMV kernels. Using DIA-Adaptive, the ideal storage format and kernel are automatically chosen for any given diagonal sparse matrix, and thus high performance is achieved. Experimental results show that our proposed DIA-Adaptive is effective, and has high performance and good parallelism, and outperforms the state-of-the-art SpMV algorithms for all test cases.
A parallel sparse approximate inverse preconditioning algorithm based on MPI and CUDA
2021, BenchCouncil Transactions on Benchmarks, Standards and Evaluations
In this study, we present an efficient parallel sparse approximate inverse (SPAI) preconditioning algorithm based on MPI and CUDA, called HybridSPAI. For HybridSPAI, it optimizes a latest static SPAI preconditioning algorithm, and is extended from one GPU to multiple GPUs in order to process large-scale matrices. We make the following significant contributions: (1) a general parallel framework for optimizing the static SPAI preconditioner based on MPI and CUDA is presented, and (2) for each component of the preconditioner, a decision tree is established to choose the optimal kernel of computing it. Experimental results show that HybridSPAI is effective, and outperforms the popular preconditioning algorithms in two public libraries, and a latest parallel SPAI preconditioning algorithm.
A thread-adaptive sparse approximate inverse preconditioning algorithm on multi-GPUs
2021, Parallel Computing
Citation Excerpt :
Sparse approximate inverse (SPAI) preconditioners have proven to be effective in accelerating the convergence rate of Krylov subspace methods, such as the generalized minimal residual method (GMRES) [1] and the biconjugate gradient stabilized method (BiCGSTAB) [2]. Moreover, compared with the incomplete factorization preconditioners [3–6], SPAI preconditioners require only a few sparse matrix–vector multiplication operations instead of triangular solves, and have attracted considerable attention [7–16]. However, the cost of constructing SPAI preconditioners is generally very expensive for large-scale problems because of their memory requirements that are almost the scale with the square of the number of nonzeros per row and computational requirements that are roughly the scale with the third power of the number of nonzeros per row.
In this study, we present an efficient thread-adaptive sparse approximate inverse preconditioning algorithm on multiple GPUs, called GSPAI-Adaptive. For our proposed GSPAI-Adaptive, there are the following novelties: (1) a thread-adaptive allocation strategy is presented for each column of the preconditioner, and (2) a parallel framework of constructing the sparse approximate inverse preconditioner is proposed on multiple GPUs, and (3) each component of the preconditioner is computed in parallel inside a thread group of GPU. Experimental results show that GSPAI-Adaptive is effective, and is advantageous over the popular preconditioning algorithms in two public libraries, and a latest parallel sparse approximate inverse preconditioning algorithm.
An efficient isogeometric topology optimization using multilevel mesh, MGCG and local-update strategy
2020, Advances in Engineering Software
Citation Excerpt :
Through the iterations on the coarser mesh, the optimization on the finer mesh can convergence quickly, e.g., only 115 iterations for the finest level, which saves lots of computational time compared to the CITO with 353 iterations. Since MGCG is modified from PCG, this section takes the PCG method, incomplete Cholesky conjugate gradient method (ICCG) [66-68] to compare with MGCG. ICCG is a PCG method with the incomplete Cholesky precondition, which can solve large-scale system of linear equations efficiently.
This paper proposes a new high-efficiency isogeometric topology optimization (HITO), including three part: multilevel mesh, multigrid conjugate gradient method (MGCG) and local-update strategy, which improves the efficiency in three aspects: mesh scale reduction, solving acceleration and design variables reduction. Four benchmark examples are used to evaluate proposed method, and the results show that the proposed HITO successfully reduces 37%–93% computational time compared to the conventional isogeometric topology optimization (CITO) and obtains consistent optimization results, which demonstrates the high-efficiency of the HITO. Furthermore, the efficiency improvement of the HITO will be more significant for the large-scale problems.
A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm
2017, Parallel Computing
Citation Excerpt :
For the axpy kernel, each thread calculates k (1 or more) element(s). Here we select the inner-product kernel in [9]. SpMV kernels that are used in this study are described as follows:
In this study, we present a novel optimization model that can automatically and rapidly generate an optimally parallel preconditioned conjugate gradient (PCG) algorithm for any given linear system on a specific multi-graphics processing unit (GPU) platform. For our proposed model, there are the following novelties: (1) a profile-based performance model for each one of the main components of the PCG algorithm, including the vector operation, inner product, and sparse matrix-vector multiplication (SpMV), is suggested, and (2) our model is general, independent of the problems, and only dependent on the resources of devices, and (3) our model is extensible. For a vector operation kernel, or inner product kernel, or SpMV kernel that is not included in our framework, once its performance model is successfully constructed, it can be incorporated into our framework. Our model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.
SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation
2024, Multimedia Tools and Applications

View all citing articles on Scopus

Jiaquan Gao received the Ph.D. degree in Computer Science from the Institute of Software, Chinese Academy of Sciences in 2002. He is currently an associate professor of the College of Computer Science and Technology at the Zhejiang University of Technology in Hangzhou, China. He respectively worked as a visiting scholar at the McGill University, Canada, from September 2007 to September 2008 and the Georgia Institute of Technology, US, from December 2011 to May 2012. His current research interests include high-performance computing (HPC), parallel algorithms on heterogeneous platforms for solving linear/nonlinear systems, computational intelligence, and information visualization. He is a senior member of the Chinese Computer Federation and a member of the ACM.

Ronghua Liang received the master degree from Hangdian University in 1996, and the Ph.D. degree in computer science from Zhejiang University in 2003. He worked as a research fellow at the University of Bedfordshire, UK, from April 2004 to July 2005 and as a visiting scholar at the University of California, Davis, US, from March 2010 to March 2011. He is currently a Professor of Computer Science and Vice Dean of College of Information Engineering, Zhejiang University of Technology, China. His research interests include computer vision, information visualization, and medical visualization.

Jun Wang received the Ph.D. degree in computer science from McGill University, Canada, in 2009. He is currently a Researcher of the School of Electrical & Computer Engineering, College of Engineering at the Georgia Institute of Technology in the US. His current research interests include high-performance computing (HPC), computational intelligence, parallel and distributed simulation.

^☆: The research has been supported by the Chinese Natural Science Foundation under grant numbers 61379017, 61379076 and 61202049, and the Natural Science Foundation of Zhejiang Province, China under grant numbers LY12A01027 and LY13F020027, and Jiangsu Key lab for NSLSCS under grant number 201204.

View full text

Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU☆

Highlights

Abstract

Introduction

Section snippets

MIC PCG algorithm

GPU kernels and optimization

Numerical results

Conclusion

Acknowledgments

J. Parallel Distrib. Comput.

J. Comput. Appl. Math.

Parallel Comput.

Concurrent number cruncher: a GPU implementation of a general sparse linear solver

Int. J. Parallel Emergent Distrib. Syst.

Discrete Variational Derivative Method: A Structure-Preserving Numerical Method for Partial Differential Equations

High-performance PCG solves for FEM structural analysis

Int. J. Numer. Methods Eng.