1 Introduction

1.1 Motivation

Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. The block-tridiagonal equations are arisen from classical finite-difference discretizations of two-dimensional separable elliptic equations. For instance, the Fourier-transformed partial differential equations (PDEs) have block-tridiagonal characteristics when the underlying equations involve at most second-order derivatives for three-dimensional systems, where two coordinates are angular and one is radial. Particularly, the optimal resource allocation schemes of cloud-computing platform are found by solving the block-tridiagonal equations, which are generated by queuing model.

The frequency of the processor is difficult to significantly increase due to the limitation of the current very large-scale integration (VLSI) technology. In response, multicore processors and/or specialized hardware accelerators have been designed and developed by most hardware manufactures [1]. The new parallel characteristics of new architectures are used and exploited to improve the performance of programs. The appearance of GPUs for general-purpose computing platforms offers powerful parallel processing capabilities at a low cost, and GPUs are widely used in the field of high-performance computing.

There are two methods to solve the block-tridiagonal equations, which are the direct methods and iterative methods. The solutions are obtained by elimination for direct methods, and some zero elements may become non-zero elements in the process of elimination. Furthermore, the error of the solutions will increase with more elimination operations. The iterative methods are suitable for large sparse matrices, such as Jacobi method, Gauss–Seidel method, GMRES, and Conjugate gradient, etc, and have no cumulative error. However, the stability of the iterative methods is poor, and the iterations may be slow convergence for some sparse matrices.

1.2 Our contributions

For block-tridiagonal system of linear equations, we present a solving method which mixes direct and iterative methods. Our method generates a sequence of approximate solutions \(\{x^k\}\) in the same way as the conventional iteration methods, where \(x^k\) expresses the approximate solutions of the kth iteration. The conventional iteration methods essentially involve a matrix A only in the context of matrix-vector multiplication and do not make full use of tridiagonal characteristics of block-tridiagonal matrices, so as to result in slow convergence. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks up and down the main diagonal have little non-zero elements. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Therefore, the computational complexity of the hybrid method is not increased and the convergence speed can be accelerated.

According to our experiments on ten test cases, the performance improvement using our algorithm is very effective and noticeable. The average number of iterations is reduced by 283.15 and 18.34 % using our method compared with CG and BiCGSTAB of PARALUTION library, respectively, and the performance using our method is better than those of the commonly used iterative and direct methods, and the performance of solving the test cases on GPU is improved by 26.98, 11.52, and 9.25 % using our method compared with CG, BiCGSTAB of PARALUTION library, and cuSolverSP of CUDA.

The remainder of the paper is organized as follows. In Sect. 2, we review the related research on solving block-tridiagonal system of linear equations. In Sect. 3, we present an introduction to CUDA. In Sect. 4, we develop the method of solution for solving block-tridiagonal matrices. In Sect. 4.6, we describe parallel implementation of our method on GPU. In Sect. 5, we demonstrate the performance comparison results in our extensive experiments. In Sect. 6, we conclude the paper.

2 Related work

Block-tridiagonal matrices have a central diagonal and two adjacent, which are located at a distance m from the center, and m is the size of the block matrix. There has been considerable work in developing solution algorithms for block-tridiagonal matrices. For a block-tridiagonal matrix A, it is possible to obtain an exact inverse (direct solution) with no fill-in using the well-known Thomas [2] serial algorithm, which is easily generalized for block sizes \(m=1\). While this is the fastest algorithm on a serial computer, it is not parallelizable, since each solution step in the algorithm depends on the preceding one. Many authors have considered efficient parallel block solvers for scalar block (\(m = 1\)) matrices based on cyclic reduction [3]. Cyclic reduction was first described by Heller [4] for block-tridiagonal systems, although an efficient parallel code was not described. The new code BCYCLIC described here fills a software gap in the available codes for solving tridiagonal systems with large \((m=1)\), dense blocks [5]. Reference [6] analyzed the projection of four known parallel tridiagonal system solvers: cyclic reduction [7, 8], recursive doubling [9], Bondeli’s divide and conquer algorithm [10], and Wang’s partition method [11]. Cyclic reduction and recursive doubling focus on a line grain of parallelism, where each processor computes only one equation of the system. Reference [11] developed a partition method with a coarser grain of parallelism. ScaLAPACK [12] provided another method for efficiently solving dense block-tridiagonal systems. Block factorization and solution based on ScaLAPACK are currently implemented in the SIESTA [13] MHD equilibrium code. This technique scales well with processor count only for very large matrix block sizes. For matrix blocks of interest for small 3D MHD problems \((m=300)\), scalability was found to be limited to about 5–10 processors.

Some works [1424] presented how hardware and software can work in concert on scalable multiprocessor systems with a number of illustrative matrix-based examples and applications, and provided a historical perspective and relevant context to numerical computation on CPU–GPU heterogeneous computing systems. Some works [2533] discussed how processor technologies can help scientific computing applications which involve matrix operations. Two-level parallelization [34] was introduced to solve a massive block-tridiagonal matrix system. One-level was used for distributing blocks, whose size is as large as the number of block rows due to the spectral basis, and the other level is used for parallelizing in the block row dimension.

Some researchers have investigated parallel algorithms for solving systems of linear algebraic equations with block-tridiagonal matrices using iteration methods. In [35, 36], the Krylov method is used [37], and special preconditioning procedures, which account for the structure of a block-tridiagonal matrix, are used to increase the rate of convergence. The efficiency of the algorithms described above depends on a great extent on the size of the matrix blocks. As m increases, the local calculations become a greater part of the calculation, and consequently, the efficiency of the method increases. This determines the domain of applicability of presently available parallel algorithms. An important case in practice is the case when the matrix block size is not large, i.e., does not exceed several hundreds. This is because, for such a matrix, one can effectively apply a cost-effective variant of the Gaussian method with overheads of O(\(m^3n\)) [38]. For a large cloud-computing platform, there are tens of thousands of computing resources which have several or dozens of execution states, and state matrix generated by state transition is a block-tridiagonal matrix. Therefore, for this case, it is necessary to develop a parallel algorithm with comparable overheads, which will also enable the effective implementation of thousands of processors.

There are two common methods for solving block-tridiagonal equations: using a block-tridiagonal factorization , or treating the matrix as a band matrix [39]. Some preconditioners are used to improve the performance and robustness of solving block-tridiagonal equations. For a symmetric positive definite block-tridiagonal matrix A, using the Cholesky decomposition once can reduce the operation count [39, 40]. Ruggiero and Galligani [41] considered a new form of the arithmetic mean method for solving large block-tridiagonal linear systems and proposed a parallel implementation using the iterative method and the preconditioner on a multi-vector computer. [42] presented an efficient blockwise update scheme for the QR decomposition of block-tridiagonal and block Hessenberg matrices, which come up in generalizations of the Krylov space solvers to block methods for linear systems of equations with multiple right-hand sides, and could improve the performance and stability of block Krylov space solvers. Incomplete LU factorization is a common way to construct preconditioner for sparse linear systems and is known to give excellent results for this class of problems. Some theory for block ILU preconditioner is discussed in [37, 43]. This type of approach needs to approximate inverse of pivot blocks.

Work on SpMV [4446], LU factorization [47], and general hybridized linear algebra routines [48], including tridiagonal solvers on GPUs, has been studied and optimized in the recent literature [4952]. For GPUs provided by NVIDIA, NVIDIA provided CUDA (compute unified device architecture) to improve the development efficiency of parallel program [53]. CuSolver library [54] provided by CUDA tools includes some implementation codes of the direct solving algorithms based on the cuBLAS and cuSPARSE libraries [55]. PARALUTION [56] proposed a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators, including multi/many-core CPU and GPU platforms, and provided a portable library for iterative sparse methods on the state-of-the-art hardware. [57] presented the parallel algorithm of the generalized minimal residual iterative method using the Compute Unified Device Architecture programming language and the MPI parallel environment. The GREMLINS code [58] had been developed for solving large sparse linear systems on distributed grids. [59] analyzed the performance of the GREMLINS code obtained with several libraries for solving the linear subsystems and compared with that of the widely used PETSc library [60] that enables one to develop portable parallel applications.

The conventional iteration methods essentially involve a matrix only in the context of matrix-vector multiplication and do not make full use of tridiagonal characteristics of block-tridiagonal matrices, so as to result in slow convergence. Furthermore, the utilization of shared memory has a large effect on the performance of solving on GPU. The whole matrix is stored in the global memory using the conventional GPU-based methods that lead to larger data access latency. However, we present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access to improve solving performance. Although PARALUTION provided some iterative and direct solving algorithms on GPU, no hybrid iterative and direct solving algorithms for block-tridiagonal equations were provided. We present the hybrid solving method which has faster convergence speed and higher parallel efficiency.

3 An introduction of CUDA

GPUs are widely used in parallel computing, because there are more cores and co-processors to speed the computing in more and more computing systems. CUDA can improve the development efficiency of parallel program. The CUDA heterogeneous programming model is shown in Fig. 1. The CUDA heterogeneous programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the main program. This is the case, for example, when the kernels execute on a GPU and the rest of the main program executes on a CPU. The number of threads is decided by the programmer to be executed. A collection of threads (called a block) runs on a multiprocessor at a given time. Multiple blocks can be assigned to a single multiprocessor and their execution is time-shared. For heterogeneous computing systems with CPUs and GPUs, each core of CPU can independently perform its instructions and data, which is called the MIMD model. If there is not dependency, it does not need synchronization between threads on various cores of CPU. For multicore CPUs, each core can be scheduled independently to perform the threads. Therefore, the partitioning methods for multicore CPU do not consider the uniformity of NNZ of rows in the block for SpMV. However, the basic computing unit of a GPU is called streaming multiprocessor (SM). As a component at the bottom of the independent hardware structure, SM can be seen as an SIMD processing unit. Each SM contains some scalar processors (SP) and special function units (SFU) [53]. It needs synchronization between SPs of the same SM on GPU [53]. Therefore, the inequality of the load of different threads will have a larger impact on performance.

Fig. 1
figure 1

CUDA heterogeneous programming model

4 The solving method for blocked tridiagonal matrix

4.1 The compressed storage format

For many scientific and engineering applications, these sparse matrices may have various sparsity characteristics. Different sparse matrices with different sparse distributions should be stored by the most appropriate compressed storage format.

The 6-by-6 sparse matrix A shown below is used as a running example in this section:

$$\begin{aligned} A= \left( \begin{array}{llllll} 6 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 9 &{}\quad 7 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 5 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 0 &{}\quad 8 &{}\quad 7 &{}\quad 3 \\ 0 &{}\quad 0 &{}\quad 6 &{}\quad 2 &{}\quad 11 &{}\quad 4 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 6 &{}\quad 21 \end{array} \right) . \end{aligned}$$

4.1.1 COO storage format

The coordinate (COO) format is a particularly simple storage scheme with a triplet (\(\mathbf {row}\), \(\mathbf {column}\), \(\mathbf {value}\)). The arrays \(\mathbf {row}\), \(\mathbf {column}\), and \(\mathbf {value}\) store the row indices, column indices, and values of the non-zero elements in a matrix, respectively. For the example sparse matrix A, we have

$$\begin{aligned}&\mathbf{row} = ( 0 ,0 ,1 ,1, 1, 2, 2,2,3,3,3,3,4,4,4,4,5,5),\\&\mathbf{column} = ( 0,2 ,0 ,1 ,2 ,1 ,2,3 ,1, 3, 4 , 5, 2, 3, 4, 5, 4, 5 ),\\&\mathbf{value} = ( 6,2,1,9,7,5,4,3,1,8,7,3,6,2,11,4,6,21 ). \end{aligned}$$

4.1.2 CSR storage format

The compressed sparse row (CSR) format is a popular and general-purpose sparse matrix representation scheme. CSR explicitly stores column indices and non-zero values in arrays \({\mathbf {Aj}}\) and \({\mathbf {Av}}\). The third array \({\mathbf {Ap}}\) represents the starting position of each row in the array \({\mathbf {Aj}}\). For an n-by-m matrix, \({\mathbf {Ap}}\) has length \(n+1\) and stores the offset of the ith row in \({\mathbf {Ap}}[i]\). The value of the last element is \({ NNZ}\), which is the number of non-zero elements. For the example sparse matrix A, we have

$$\begin{aligned}&\mathbf {Ap} = ( 0 ,2 ,5 ,8, 12, 16, 18),\\&\mathbf {Aj} = ( 0,2 ,0 ,1 ,2 ,1 ,2,3 ,1, 3, 4 , 5, 2, 3, 4, 5, 4, 5 ),\\&\mathbf {Av} = ( 6,2,1,9,7,5,4,3,1,8,7,3,6,2,11,4,6,21 ). \end{aligned}$$

4.1.3 DIA storage format

For the n-order sparse matrix A with k diagonal line, it can greatly reduce the storage overhead when only the non-zero elements in diagonal are stored. DIA format includes two parts. The first part is an \(n \times k\) matrix \(\mathbf {D}\) to store the value on the diagonal. The second part is an array \({\mathbf {offset}}\) to store k elements, which are the offsets of a diagonal with respect to the main diagonal. For the sparse matrix A, we have

$$\begin{aligned} \mathbf {D} = \left( \begin{array}{lllll} * &{}\quad * &{}\quad 6 &{}\quad 0 &{}\quad 2 \\ * &{}\quad 1 &{}\quad 9 &{}\quad 7 &{}\quad 0 \\ 1 &{}\quad 5 &{}\quad 4 &{}\quad 3 &{}\quad 0 \\ 6 &{}\quad 0 &{}\quad 8 &{}\quad 7 &{}\quad 3 \\ 0 &{}\quad 2 &{}\quad 11 &{}\quad 4 &{}\quad * \\ 0 &{}\quad 6 &{}\quad 21 &{}\quad * &{}\quad * \end{array} \right) , {\mathbf {Offset}} = \left( -2 ,-1 , 0 , 1 , 2 \right) . \end{aligned}$$

DIA is perfect format for the compression and storage of diagonal matrix, but not good for dispersed sparse matrix [44]. The main reason is that the presence of non-zero element in most of diagonals will lead to too much columns in matrix \(\mathbf {D}\). As the matrix with non-zero elements in all diagonals, in this extreme case, the data matrix \(\mathbf {D}\) will involves \(2n-1\) columns, and it will cost double storage volume than the original matrix.

4.2 The blocked tridiagonal matrix

The block-tridiagonal matrix A is shown as follows:

$$\begin{aligned} A= \left( \begin{array}{lllllll} D_1 &{}\quad U_1 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ L_2 &{}\quad D_2 &{}\quad U_2 &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad L_3 &{}\quad D_3 &{}\quad U_3 &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots &{}\quad \\ &{}\quad &{}\quad &{}\quad &{} L_{m-1} &{} D_{m-1} &{} U_{m-1}\\ &{}\quad &{}\quad &{}\quad &{}\quad &{} L_m &{} D_m \\ \end{array} \right) , \end{aligned}$$

where \(D_1,D_2,\ldots ,D_m\), \(L_2,L_3,\ldots ,L_m\), and \(U_1,U_2,\ldots ,U_{m-1}\) are \(k \times k\) squares, and the other positions in B are zero. Assume that the number of rows in B is n, where n is \(m \times k\).

4.3 Partitioning of block-tridiagonal matrix

The block-tridiagonal matrix A is partitioned into 3\(\times (m-2)\) squares, which are \(D_1,D_2,\ldots ,D_m\), \(L_2,L_3,\ldots ,L_m\), and \(U_1,U_2,\ldots ,U_{m-1}\). \(L_2,L_3,\ldots ,L_m\), and \(U_1,U_2,\ldots ,U_{m-1}\) are very sparse in the block-tridiagonal matrices which are arisen from the most scientific computing applications, such as finite difference and resource scheduling, and most non-zero elements are in \(D_1,D_2,\ldots ,D_m\). Therefore, \(L_2,L_3,\ldots ,L_m\) and \(U_1,U_2,\ldots ,U_{m-1}\) are stored as the compressed formats, such as COO or CSR formats. \(D_1,D_2,\ldots ,D_m\) are stored as the suitable storage formats according to the distribution features of the non-zero elements. They are stored as dense matrices if non-zero elements are more dense, and are stored as DIA format if they are tridiagonal matrices. The block-tridiagonal matrices arisen from resource scheduling are partitioned into \(D_1,D_2,\ldots ,D_m\), which are tridiagonal matrices.

4.4 Hybrid blocked iterative solving algorithm (HBISA)

The block-tridiagonal equation is given by Eq. (1):

$$\begin{aligned} Ax=b. \end{aligned}$$
(1)

Equation (1) can be given by the following:

$$\begin{aligned} \left( \begin{array}{lllllll} D_1 &{}\quad U_1 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ L_2 &{}\quad D_2 &{}\quad U_2 &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad L_3 &{}\quad D_3 &{}\quad U_3 &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots &{}\quad \\ &{}\quad &{}\quad &{}\quad &{} L_{m-1} &{} D_{m-1} &{} U_{m-1}\\ &{}\quad &{}\quad &{}\quad &{}\quad &{} L_m &{} D_m \\ \end{array} \right) \times \left( \begin{array}{c} x_1\\ x_2\\ x_3\\ \vdots \\ \vdots \\ x_{m-1}\\ x_m\\ \end{array} \right) =\left( \begin{array}{c} b_1\\ b_2\\ b_3\\ \vdots \\ \vdots \\ b_{m-1}\\ b_m\\ \end{array} \right) , \end{aligned}$$

where \(x_j\) and \(b_j\) for \(j=1,2,\ldots ,m\) are the sub-vectors of x and b. \(x_j\) includes the \(((j-1) \times k)\)th to the \((j \times k)\)th elements of x, and \(b_j\) includes the \(((j-1) \times k)\)th to the \((j \times k)\)th elements of b.

Assume that \(B=L+D+U\), where

$$\begin{aligned} L= & {} \left( \begin{array}{lllllll} &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ L_2 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad L_3 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad \ddots &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad \ddots &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad L_{m-1} &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad L_m &{}\quad \\ \end{array} \right) ,\\ D= & {} \left( \begin{array}{lllllll} D_1 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad D_2 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad D_3 &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad \ddots &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad \ddots &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad D_{m-1} &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad D_m \\ \end{array} \right) , \end{aligned}$$

and

$$\begin{aligned} U= \left( \begin{array}{llllll} U_1 &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad U_2 &{}\quad &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad U_3 &{}\quad &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad \ddots &{}\quad &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad \ddots &{}\quad \\ &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad U_{m-1}\\ &{}\quad &{}\quad &{}\quad &{}\quad &{}\quad \\ \end{array} \right) . \end{aligned}$$

Therefore, we have

$$\begin{aligned} (L+D+U)x=b, \end{aligned}$$
(2)

and

$$\begin{aligned} Dx=b-Lx-Ux. \end{aligned}$$
(3)

Letting

$$\begin{aligned} r=b-Lx-Ux. \end{aligned}$$

Then

$$\begin{aligned} Dx=r. \end{aligned}$$
(4)

Solving Eq. (4) is much easier than to solve Eq. (1). Define the iterative equation:

$$\begin{aligned} Dx^{i+1}=b-Lx^i-Ux^i,\;\; x^0=0. \end{aligned}$$
(5)

For each block in B, Eq. (5) can be expressed as

$$\begin{aligned} D_jx_j^{i+1}=\left\{ \begin{array}{ll} b_j-U_jx_{j+1}^i &{}\quad j=1;\\ b_j-L_jx_{j-1}^i-U_jx_{j+1}^i &{}\quad 1<j<m;\\ b_j-L_jx_{j-1}^i &{}\quad j=m.\\ \end{array} \right. \end{aligned}$$
(6)
figure a

For Algorithm 1, i index represents the ith iteration for solving process, and j index represents the jth block for the block-tridiagonal matrix. \(r_j\) is a right-hand vector for \(D_jx_j^i=r_j\), which can be stored in an array to reuse in all iterations. \(r_j\) are independent vectors for different values of j. There are no data dependency among different \(r_j\). Therefore, lines 5–17 can be processed in parallel.

For \(j>1\), \(x_{j-1}^{i+1}\) has been computed in the current iteration, and Eq. (6) can be replaced by Eq. (7):

$$\begin{aligned} D_jx_j^{i+1}=\left\{ \begin{array}{ll} b_j-U_jx_{j+1}^i &{}\quad j=1;\\ b_j-L_jx_{j-1}^{i+1}-U_jx_{j+1}^i &{}\quad 1<j<m;\\ b_j-L_jx_{j-1}^{i+1} &{}\quad j=m.\\ \end{array} \right. \end{aligned}$$
(7)

4.5 The iterative convergence analysis

Lemma 1

Assume that \(Ax=b,\, A=D+L+U\), where A is the nonsingular block-tridiagonal matrix and \(D,\,L,\,U\) are submatrices which are obtained by Eq. (2). The necessary and sufficient condition of convergence for solving \(Ax=b\) using Algorithm 1 is \(\rho (D^{-1}(L+U)) <1\), where \(\rho (D^{-1}(L+U))\) is the spectral radius of \(D^{-1}(L+U)\).

Proof

(Sufficiency): D is the nonsingular matrix, because A is the nonsingular block-tridiagonal matrix. Therefore, Eq. (5) can be transformed into the following equation:

$$\begin{aligned} x^{i+1}=D^{-1}(b-Lx^i-Ux^i)=-D^{-1}(L+U)x^i+D^{-1}b. \end{aligned}$$

Assume that \(B=-D^{-1}(L+U)\) and \(f=D^{-1}b\), the above iterative equation can be expressed \(x^{i+1}= Bx^i+ f\). \(Ax=b\) has a unique solution, because A is a nonsingular matrix. Assume that the solution is \(x^*\) and \(x^*=Bx^* + f\). Assume that an error vector is \(\epsilon ^k=x^k-x^*=B^k\epsilon ^0,\,\epsilon ^0=x^0 - x^*\), where \(B^k\) expresses the matrix B raised to the power k. \(\rho (B)<1\), because \(\rho (-D^{-1}(L+U)) <1\) and \(\rho (B)=\rho (-D^{-1}(L+U))\). \(\rho (B)\) is the supremum among the absolute values of the eigenvalues of B, and \(\rho (B)<1\) if and only if \(\lim \nolimits _{k\rightarrow \infty }B^k=0\) [61]. Therefore, \(\lim \nolimits _{k\rightarrow \infty }\epsilon ^k=0\) for any \(x^0\), and \(\lim \nolimits _{k\rightarrow \infty }x^k=x^*\).

(Necessity): Assume that \(\lim \nolimits _{k\rightarrow \infty }x^k=x^*\) for any \(x^0\), where \(x^{k+1}=Bx^k+f\). Obviously, \(x^*\) is the solution of \(x=Bx+f\), and for any \(x^0\), \(\epsilon ^k=x^k-x^*=B^k\epsilon ^0 \rightarrow 0 \,(k \rightarrow \infty )\). Therefore, \(\lim \nolimits _{k\rightarrow \infty }B^k=0\), and \(\rho (B)<1\). Therefore, \(\rho (-D^{-1}(L+U)) <1\), because \(\rho (-D^{-1}(L+U))=\rho (B)\). \(\square \)

4.6 Parallel implementation

The iterative solving algorithm for the block-tridiagonal equation can be parallelized. We can observe that the codes in lines 5–17 can be executed in parallel, and each i of loops can be processed as a task in parallel. If Algorithm 1 is processed in parallel, Eq. (6) should be used in lines 9, 12, and 14, because \(x_{j-1}^{i+1}\) is computed synchronously and cannot be obtained in the same iteration. \(L_i\), \(D_i\), and \(U_i\) are assigned to processors as standalone tasks for \(i=1,2,\ldots ,q\), and lines 7–16 in Algorithm 1 are computed in parallel. The residual \(\beta \) of the iterative solution can be piecewise calculated in parallel, and line 18 can be replaced by \(\beta ^i_j \leftarrow ||x^i_j-x^{i-1}_j||\) and moved into the loop 5–17. The tasks allocation is shown in Fig. 2.

Fig. 2
figure 2

Tasks allocation for solving the block-tridiagonal equation using Algorithm 1 in parallel

For CUDA, the threads are allocated and scheduled in accordance with blocks. Furthermore, the number of cores in GPU is far more than that of CPU, and SpMV of lines 12 and 17 in Algorithm 1 is computed in parallel on GPU. Some common parallel algorithms can be used to solve \(D_jx_j^i=r_j\), such as CR and PM. Due to the small size of \(D_j\), the performance of direct solving algorithms should be good. SpMV is suited to be computed in parallel on GPU using CUDA, and each row of \(L_j\) and \(U_j\) is assigned to a thread of the block to execute multiplication with \(x_j\), which is the jth approximate solution (Fig. 3).

Fig. 3
figure 3

Parallel computing tasks for Algorithm 1 on GPU

\(L_j\), \(D_j\), and \(U_j\) of each block can be loaded into the shared memory on GPU to reduce access latency, because these data sets can be used in the all iterations. The approximate solutions \(x_j\) generated by the iterations are used by multiple SpMV, and should be stored in the shared memory. The residual \(\beta \) of the iterative solution can be calculated by two steps, which are inside a block and between blocks. The parallel reduction algorithm can be used to get the residual \(\beta \), and the parallel reduction algorithm of the residual calculation using CUDA is shown in Algorithm 2.

figure b

In Algorithm 2, the maximum residual of a block is calculated from lines 5–12, and then, it is stored in the first element in the residual sub-array, which corresponds to the block. The maximum residual of the whole residual array is obtained from the maximum residuals of the blocks, and it is stored in the first element of the residual array.

5 A case study: cloud resources scheduling

More and more computing resources are deployed in the cloud-computing platform, and massive amounts of computing tasks are processed in the cloud-computing platform. The computing resources should be allocated reasonably, and the computing tasks should be processed in time. Some computing resources will be idle if too much computing resources are turned on. Therefore, some resources should be turned off when the number of computing tasks is declined; in contrast, more computing resources should be turned on when the number of computing tasks is increased. The number of computing resources that are turned on should be consistent with that of the arrived tasks. However, some computing resources may not be fully used, because the arrival time of tasks is random.

5.1 M/M/\(n+k\) dynamic queuing model

Assume that the arrival of tasks meets Poisson distribution and the processing time of the tasks meets exponential distribution. M/M/\(n+k\) dynamic queuing model can describe the process of cloud resources scheduling if computing resources are the same. A new computing resource will be turned on if the number of tasks processed on the computing resource reaches an upper limit value. In contrast, an used computing resource will be turned off if the average number of tasks processed on the computing resources is below a lower limit value. The tasks migrated to new computing resources will be delayed, because turning on and off computing resources must be time-consuming. The state transition diagram of task allocation and resource scheduling using M/M/\(n+k\) dynamic queuing model is a birth and death process, and instantaneous states of flow are showed in Fig. 4, where the ellipses express the states of the resources and the numbers in ellipses express the numbers of pending tasks assigned to the resources. The first line of Fig. 4 expresses the states of the cloud-computing system with one resource and the second line expresses the system with two resources, and so on. \(\lambda \), \(\mu \), \(\eta \), and \(\nu \) express request arrival rate, processor’s service rate, processor’s opening rate, and processor’s closing rate. The number i in ellipses expresses that there are i pending tasks in the cloud-computing system.

Fig. 4
figure 4

State transition diagram of task allocation and resource scheduling using M/M/\(n+k\) dynamic queuing model

Assume that there are n computing resources in a cloud-computing platform. When more than k-computing tasks assigned into a computing resource are pended, a new computing resource will be turned on to process the new tasks. Furthermore, the computing resources without computing task should be turned off.

For a cloud-computing system, how many computing resources need to be deployed? The utilization efficiency of resources should be computed. The probability that the resources are used must be calculated for the poisson distribution of the arrival of tasks. Define that \(S_{i,j}\) is the state, which has i pending tasks in j computing resources. The transition probability between the system states can be calculated by

$$\begin{aligned} \left\{ \begin{array}{l} \lambda _{i,j}= \left\{ \begin{array}{ll} \lambda ,&{}\quad i \le j \times k\\ 0, &{}\quad i>j \times k\\ \end{array} \right. \\ \mu _{i,j}=\left\{ \begin{array}{ll} \lfloor \frac{i}{k} \rfloor \times \mu ,&{}\quad j \le i \le j \times k\\ k \times \mu , &{}\quad i>j \times k\\ \mu , &{}\quad i< j\\ \end{array} \right. \\ \eta _{i,j}= \left\{ \begin{array}{ll} 0,&{}\quad i \le j \times k\\ \lceil \frac{i-j \times k}{k} \rceil \times \eta , &{}\quad i>j \times k\\ \end{array} \right. \\ \nu _{i,j}= \left\{ \begin{array}{ll} 0,&{}\quad i \ge j \times k\\ \lfloor \frac{j \times k-i}{k} \rfloor \times \nu , &{}\quad i<j \times k\\ \end{array} \right. \\ \end{array} \right. \end{aligned}$$
(8)

If there are too many pending tasks in a computing node, another computing nodes should be turned on to process the new tasks. The probability of the state \(S_{i,j}\) will be very little when \(i>k \times (j+1)\) for j computing resources, and the probability of the state \(S_{i,j}\) will be very little when \(i<k \times (j-1)\) for j computing resources, because the computing resources without computing task should be turned off, so the states can be ignored. The transition probability matrix P is a block-tridiagonal matrix composed of Eq. (8), whose size is \(n \times n\) and the size of blocks is k. The \(k \times k\) submatrices along the main diagonal are tridiagonal matrices, and another submatrices outside the main diagonal are sparse matrices. The computing resources should not be turned on and off frequently, because it takes time to turn off and on. Therefore, a certain number of computing resources should be turn on in advance, but how many computing resources should be turn on in advance? Define that \(\pi _i\) is the probability of the state with i pending tasks in cloud system. The probabilities of \(\pi _i\) for \(i=1,2,\ldots ,n\) are obtained by solving the following Markov stationary equation:

$$\begin{aligned} \pi =\pi P. \end{aligned}$$
(9)

Assume that the state with maximum probability is \(\pi _M\), which has M pending tasks, M / k computing resources should be turn on in advance to reduce waiting time of pending tasks.

5.2 Hybrid storage format for block-tridiagonal matrix

\(D_i\) for \(i=1,2,\ldots ,n\) are tridiagonal matrices in the block-tridiagonal matrix obtained by M/M/\(n+k\) dynamic queuing model for cloud resources scheduling modeling, and \(U_i\) and \(L_i\) for \(i=1,2,\ldots ,n-1\) are sparse matrices.

The 9-by-9 block-tridiagonal matrix B shown below is an example obtained by M/M/\(n+k\) dynamic queuing model:

Matrix B has seven diagonal blocks which have three rows and three columns. \(D_1\), \(D_2\), and \(D_3\) are tridiagonal matrices, which can be stored as DIA format. Furthermore, \(\mathbf {offset}\) array can be omitted, because the offsets of three diagonals are fixed (\(-1,0,1\)). \(L_2\), \(L_3\), \(U_1\), and \(U_2\) are sparse matrices, which have few non-zero elements in each row. Therefore, \(L_2\), \(L_3\), \(U_1\), and \(U_2\) should be stored as COO format. The number of storage units of \(D_1\), \(D_2\), and \(D_3\) is 21 and that of \(L_2\), \(L_3\), \(U_1\), and \(U_2\) is 36. Few storage units are occupied using COO format for \(L_2\), \(L_3\), \(U_1\), and \(U_2\), because there are few non-zero elements and irregular distributions of non-zero elements in \(L_2\), \(L_3\), \(U_1\), and \(U_2\).

5.3 Experimental evaluation

5.3.1 Experiment settings

The following test environment has been used for all benchmarks. The test computer is equipped with two AMD Opteron 6376 CPUs running at 2.30 GHz and an NVIDIA Tesla K20c GPUs. Each CPU has 16 cores. The GPU has 2496 CUDA processor cores, working at 0.705GHz clock, and possessing 4 GB global memory with 320 bits bandwidth at 2.6 GHz clock, and the CUDA compute capacity is 3.5. As for the software, the test machine runs the 64 bit Windows 7 and NVIDIA CUDA toolkit 7.0. The hardware parameters of the testing computer are shown in Table 1.

Table 1 Parameters of the test computer

All benchmarks are chosen from the simulation of the cloud-computing system. Ten kinds of cloud-computing systems were simulated to get ten block-tridiagonal matrices. Further characteristics of these matrices are given in Table 2, where n, k, and \({ NNZ}\) are the number of matrices, the size of block, and the number of non-zero elements, respectively, in matrices.

Table 2 General information of the block-tridiagonal matrices used in the evaluation

5.3.2 The test using HBISA on GPU

The tridiagonal submatrices along the main diagonal of the coefficient matrix are solved by our parallel CR algorithm for GPU. SuiteSparse library provides a sparse direct solver SuiteSparseQR for GPU [62, 63]. CUDA tools also provide cuSolver library, which are a high-level package based on the cuBLAS and cuSPARSE libraries [54]. The cuSolverSP of cuSolver library provides a new set of sparse routines based on a sparse QR factorization. However, the performance of SuiteSparseQR is worse than that of cuSolverSP. For QR factorization, non-zero elements will increase sharply resulting in a decline in performance because of irregular distributions of non-zero elements. The cases were solved using the iterative solvers of PARALUTION library for parallel and serial modes on CPUs and GPU [56], and the solving process is not convergent using the GMRES method of PARALUTION for the test cases. Therefore, the iterative solvers of PARALUTION library using the CG and BiCGSTAB methods are used to test the cases.

The convergence speed refers to the number of iterations of solving the equation using iterative algorithms. The less number of iterations, the faster convergence speed. The convergence tolerance is set to 0.0000001 for the iterative algorithms in the test. The cases are solved by the tested algorithms using double precision. The number of iterations for solving the cases using iterative algorithms HBISA, CG, and BiCGSTAB is shown in Table 3, where \({ No}\) expresses that the solving process is not convergent using the iterative algorithm for the test case. It is observed from Table 3 that Cloud32–1000 cannot be solved by CG. HBISA and BiCGSTAB have better robustness than CG, and the number of iterations using CG is far more than that of HBISA and BiCGSTAB for Cloud16–500, Cloud16–800, Cloud16–1000, and Cloud16–2000. The convergence speeds of HBISA are faster than those of CG and BiCGSTAB exception Cloud64–500. For the test cases, the average number of iterations is reduced by 283.15 and 18.34 % using HBISA compared with CG and BiCGSTAB of PARALUTION library.

Table 3 Number of iterations for solving the cases using iterative algorithms

The performance of solving the test cases on K20c GPU is shown in Table 4, where \({ No}\) represents that the case cannot be solved using the algorithm. For Cloud16–500, Cloud16–800, Cloud16–1000, and Cloud16–2000, the performance of CG is significantly worse than those of HBISA and BiCGSTAB, because the number of iterations using CG is far more than that of HBISA and BiCGSTAB for these cases. The performance of HBISA is worse than those of CG, BiCGSTAB, and cuSolverSP, because the convergence speed of HBISA is slow for Cloud64–500.

Table 4 Performance of solving the cases (unit: second)

The parallel efficiency refers to the speedup, which is a metric used to express relative performance improvement. To more clearly describe the performance improvement, the performance improvement percentages using HBISA on GPU compared with CG, BiCGSTAB of PARALUTION library, and cuSolverSP of CUDA in Fig. 5 are calculate by \((T_i-T_H)/T_H \times 100\) (\(i=1,2,3\)), where \(T_1\), \(T_2\), and \(T_3\) are the performance of CG, BiCGSTAB of PARALUTION library, and cuSolverSP, respectively, and \(T_H\) is the performance of HBISA. It is observed from Fig. 5 that the average performance of solving the test cases on GPU is improved by 26.98, 11.52, and 9.25 % using HBISA compared with CG, BiCGSTAB of PARALUTION library, and cuSolverSP of CUDA.

Fig. 5
figure 5

Performance improvement of solving the cases using HBISA on K20c GPU

6 Conclusion

In this paper, a parallel hybrid solving algorithm is proposed for block-tridiagonal systems of linear equations, and the performance is better than that of the other iterative algorithms and direct algorithms on GPU because of faster convergence speed and higher parallel efficiency. Furthermore, we analyse cloud resources scheduling model and obtain ten block-tridiagonal matrices produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved by using our method. Other sparse linear systems arising from numerical simulation may be quasi-block-diagonals equations, and how to solve them quickly will be our next step of investigation.