Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations

doi:10.1016/S0167-8191(99)00064-2

Parallel Computing

Volume 25, Issues 13–14, December 1999, Pages 1995-2014

https://doi.org/10.1016/S0167-8191(99)00064-2 Get rights and content

Abstract

This paper is concerned with the parallel implementation of the incomplete factorization preconditioned iterative method. Although the use of such parallel ordering as multicolor ordering may increase parallelism in factorization, it often slows convergence when used in the preconditioned method, and thus may offset the gain in speed obtained with parallelization. Further, the higher the parallelism of an ordering, the slower the convergence; the lower the parallelism, the faster the convergence. This well-known trade-off between parallelism and convergence is well explained by the property of compatibility, the level of which can be clearly seen when ordering is presented in graph form (S. Doi, A. Lichnewsky, A graph-theory approach for analyzing the effects of ordering on ILU preconditioning, INRIA report 1452, 1991). In any given method, the fewer the incompatible local graphs in an ordering (i.e., the lower the parallelism), the faster the convergence (S. Doi, Appl. Numer. Math. 7 (1991) 417–436; S. Doi, in: T. Nodera (Ed.), Advances in Numerical Methods for Large Sparse Sets of Linear Systems, 7 Keio University, 1991). An ordering with no incompatible local graphs, for example, such as that implemented on vector multiprocessors by using the nested dissection technique, will have excellent convergence, but its parallelism will be limited (S. Doi, A. Lichnewsky, Int. J. High Speed Comput. 2 (1990) 143–179). To attain a better balance, a certain degree of incompatibility is necessary. In this regard, increasing the number of colors in multicolor ordering can be a useful approach (S. Fujino, S. Doi, in: R. Beauwens (Ed.), Proceeding of the IMACS Internation Symposium on Iterative Methods in Linear Algebra, March 1991; S. Doi, A. Hoshi, Int. J. Comput. Meth. 44 (1992) 143–152). Two related techniques also presented here are the overlapped multicolor ordering (T. Washio, K. Hayami, SIAM J. Sci. Comput. 16 (1995) 631–650), and a fill-in strategy selectively applied to incompatible local graphs. Experiments conducted with an SX-5/16A vector parallel supercomputer show the relative effectiveness of increasing the number of colors and also of using this approach in combination with overlapping and with fill-ins.

Introduction

A class of the most effective methods for solving a system of large sparse linear equations is the preconditioned iterative method, which can be defined as a basic iterative method applied to a preconditioned system M⁻¹Au=M⁻¹b (where M is the preconditioner) instead of being applied to the original system Au=b. The objective of the preconditioning is to reduce the condition number (or to cluster the eigenvalues) of the original system so as to reach an approximate solution with fewer iterations. This requires solving the system Mv=g. Hence, a good preconditioner needs to satisfy two requirements: it should be a good enough approximation of A to produce a reduced condition number (or better clusters of eigenvalues), and the system Mv=g should be much easier to solve than the original system. Further, increases in the parallelism of recent computer architectures have led to a new requirement: the solution of Mv=g must also have high enough parallelism to be mapped naturally on the computer to be used.

One type of commonly used preconditioning is represented by the incomplete factorization M=LU, where L and U are lower and upper triangular matrices of a sparsity structure similar to A, and are produced by neglecting certain fill-ins in the Gaussian elimination [19]. One way to attain parallelism in the solution of Mv=LUv=g (in practice, this solution is obtained first by solving Lw=g (forward substitution) and then by solving Uv=w (backward substitution)) is to reorder the system Au=b and to reconstruct an incomplete factorization for the reordered system. Actually, a number of techniques developed in the past do this [1], [3], [5], [7], [10], [12], [13], [14], [17], [18], [21]. In this regard, one possibility for this approach is to use the well-known red–black ordering. Solving Lw=g, where w has a red–black ordering can be performed simultaneously for half the unknowns in v; this is also true of the solving of Uv=w. In this case, however, a fundamental difficulty is presented by the trade-off problem between parallelism and convergence: in a system to which the incomplete factorization preconditioned iterative method applied, higher parallelism in the ordering will result in slower convergence, and higher convergence will require lower parallelism. This trade-off problem was first reported by Duff and Meurant [10] on the basis of intensive numerical tests using many orderings [10]. Since then, many attempts have been made to answer the following two fundamental questions: why there is a trade-off between parallelism and convergence in incomplete factorizations and how can the trade-off be overcome [2], [3], [6], [10], [11], [16], [20]? Before we give our own answers to these questions, it is worth repeating the remarks of Duff and Meurant [10] which are still valuable:

(DM1) The rate of convergence is almost directly related to the norm of the remainder matrix R=M−A, but not related to the number of fill-ins dropped.
(DM2) It appears that the harder the problem (discontinuous coefficients, anisotropy, etc.) is, the more important the ordering for the incomplete factorization is.
(DM3) A single level of fill-in changes the relative performance of the different ordering schemes; for example, mind, rb, and altd (parallel orderings) do somewhat better when some fill-in is allowed in L. One reason for this is that many of the first-level fill-ins for these orderings are quite large, unlike row, for example, where the fill-ins rapidly decrease in value.

Kuo and Chan [16] proved that the condition number of a preconditioned system based on a red–black ordering is only about 1/4 of that of an unpreconditioned system, showing no asymptotic improvement no matter how small grid size h becomes [16]. Eijkhout [11] used an analysis of the infinity norm of the remainder matrix R=M−A to create a criterion by which orderings for no-fill factorization might be placed in one of two categories: whether or not an ordering contains nodes that are eliminated before both neighbors in one direction are eliminated [11].

Doi and Lichnewsky [6] attempted to answer the first question “why is there a trade-off(?)” by utilizing graph representations of orderings [6]. They showed that when an ordering graph contains incompatible (local) graph, the components of the remainder matrix R corresponding to the incompatible graph are significantly larger than the components corresponding to any single compatible graph. Furthermore, while components corresponding to a compatible graph converge to zero with dynamic increase or decrease in the value of model equation parameters, (e.g., grid aspect ratio, anisotropy, etc.) components corresponding to an incompatible graph do not. Another interesting observation is that the components of r^k=b−Au^k (the residual vector after k iterations with a preconditioned iterative method) corresponding to incompatible local graphs have significantly larger values than those corresponding to compatible local graphs.¹ An ordering which has no incompatible graphs is called compatible. Doi and Lichniewsky also confirmed that the trade-off observed by Duff and Meurant could be explained very well by means of the quality of compatibility. With respect to parallel computing, the incompatible local graph corresponds to a potential starting point for forward substitution by L. This may explain the trade-off; that is, an ordering having higher parallelism has more incompatible local graphs that break the compatibility of the ordering and thus decrease the rate of convergence. It is also interesting to see that compatibility is equivalent to Eijkhout's criterion, although his study was performed independently. Doi and Lichnewsky [5] implemented several incomplete factorizations based on compatible orderings on a Cray-2 shared memory vector multiprocessor using the nested dissection technique, and showed that these orderings can be effectively mapped onto vector multiprocessors [5]. Doi [3], [4] did a more quantitative study of this trade-off problem by using two parameterized orderings, parallel block PB(m) and parallel diagonal PD(m) (where m is a parameter that controls the degree of parallelism), and showed that the rate of convergence degrades rather smoothly as the degree of parallelism increases [3]. A pragmatic solution designed to strike a better balance between parallelism and convergence is the use of 50 or more colors in multicolor ordering, rather than the conventionally used four or eight colors [4], [7], [13]. Such a large-numbered multicolor ordering technique attains high performance on an SX-3/14 vector supercomputer with minor degradation in convergence [7].

Some positive techniques have also been proposed to compensate for the significantly larger components of R that corresponding to the incompatible graphs which occasionally appear in large-numbered multicolor incomplete factorization. Washio and Hayami [24], for example, introduced the overlapping technique. The key idea is to compensate for those components by repeating parts of the forward and backward substitutions [24]. It is also possible to apply the fill-in technique to incompatible graphs. Since those appear only very rarely in large numbered multicolor ordering, this fill-in does not significantly reduce parallelism, and the cost of the fill-ins is sufficiently low.

Section 2 of this paper introduces some fundamental definitions, including equivalence and compatibility in ordering. Section 3 presents analysis that answers the question “why is there a trade-off?” Section 4 introduces ordering strategies and related techniques; those are in some sense answers to the question “how can we overcome the tradeoff?” Section 5 presents the results of numerical experiments conducted on an SX-5 vector parallel supercomputer. Our results show that a large-numbered multicolor MILU–BiCGSTAB [23] method, combined with either overlapping or selective fill-ins, can solve some discrete convection–diffusion equations, discretized on a 199³ grid, within a time on the order of 1 s, where sustained speed is about 40 Gflops on an SX-5/16A (16 CPUs, peak speed of 128 Gflops).

Section snippets

Preliminaries

This section gives definitions and theorems necessary to support the discussion in the sections which follow it. Readers are also advised to refer to [8], [9], which describe some notions and technical terms appearing here without definition.

Model problem

The discussion here is based on a model problem: a 5-point (or 7-point) finite difference discretization of a 2D (or 3D) convection–diffusion equation in the form $−∇·(K∇)u+V·∇u=f,$ as defined on a rectangular (or cubic) domain $Ω$ . The elements of $K (= diag [k_{x},k_{y},k_{z}])$ and $V (=[v_{x},v_{y},v_{z}])$ are assumed to be constant in $Ω$ . Dirichlet boundary conditions are imposed on $∂ Ω$ .

Relation between ordering graph structure and the remainder matrix R

Since the rate of convergence of the ILU preconditioned iterative method is directly related to the norm of remainder matrix R=M−A, it is

Design (implementation on vector multiprocessors)

The analysis presented in previous section serves as the basis for a parallel ordering design that strikes a reasonable balance between parallelism and convergence, one which can be implemented efficiently on actual parallel computers.

Numerical experiments

This section report results of numerical experiments conducted on an SX-5/16A vector parallel supercomputer with 16 high-performance vector processors (128 Gflops peak speed). The experiments were applied to model problem (1) (model parameters are given in Appendix A). A 199×199×199 grid was used, which produced nearly eight million unknowns.

We used the BiCGSTAB method [23] as our basic iterative method, in which four types of preconditioning were applied: (1) hyperplane MILU, (2)

Concluding remarks

In this paper, we have discussed ordering strategies and related techniques for overcoming the trade-off between parallelism and convergence that is observed in incomplete factorizations. Graph representation of ordering is important since it gives a one-to-one correspondence between each graph and each set of equivalent orderings that have the same convergence as is obtained with preconditioned iterative methods, and with graph representations, the property of compatibility can be seen to

Acknowledgements

The authors would like to thank Mr. Akira Asami of NEC Informatec Systems for conducting the tuning of our programs and for testing them on the SX-5 supercomputer.

References (25)

O. Axelsson et al.
Vectorizable preconditioners for elliptic difference equations in thee sparse dimensions
J. Comput. Appl. Math.
(1989)
S. Doi
On parallelism and convergence of incomplete LU factorizations
Appl. Numer. Math.
(1991)
V. Eijkhout
Analysis of parallel incomplete point factorizations
Linear Algebra Appl.
(1991)
H.C. Elman et al.
Ordering techniques for the preconditioned conjugate gradient method on parallel computers
Comput. Phys. Comm.
(1989)
A. Greenbaum et al.
Parallelizing preconditioned conjugate gradients algorithms
Comput. Phys. Comm.
(1989)
T. Chan, H. van der Vorst, Approximate and incomplete factorizations, in: D.E. Keyes et al. (Eds.), Parallel Numerical...
S. Doi, A Gustafsson-Type modification for parallel ordered incomplete LU factorizations, in: T. Nodera (Ed.), Advances...
S. Doi et al.
Some parallel and vector implementations of preconditioned iterative methods on cray-2
Int. J. High Speed Comput.
(1990)
S. Doi, A. Lichnewsky, A graph-theory approach for analyzing the effects of ordering on ILU preconditioning, INRIA...
S. Doi et al.
Large-numbered multicolor milu preconditioning on SX-3/14
Int. J. Comput. Math.
(1992)

J.J. Dongarra et al.

Solving linear systems on vector and shared memory computers

(1990)

J.J. Dongarra et al.

Numerical linear algebra for high-performance computers

(1998)

Cited by (38)

Block red–black MILU(0) preconditioner with relaxation on GPU
2021, Parallel Computing
To accelerate the Krylov subspace-based linear equation solvers on Graphics Processing Units (GPUs), a stable, efficient and highly parallel preconditioner is essential. One of the strong candidates for such a preconditioner is the combination of the block red–black ordering and the relaxed modified incomplete LU factorization without fill-ins (MILU(0)). In this paper, we present techniques for implementing this type of preconditioner on General-purpose computing on GPU (GPGPU) using OpenACC. Our implementation is designed for 3-dimensional finite-difference computations with 7-point stencil, and the matrix storage format is optimized to realize coalesced memory access. Also, mixed-precision computation is employed to exploit the high single-precision performance of GPUs without sacrificing the accuracy of the computed solution. Extensive numerical tests were performed and the optimal values of various tunable parameters such as the number of blocks in each direction and the number of workers specified in OpenACC clauses are discussed. Performance comparison on NVIDIA Quadro GP100 and Tesla K40t GPUs shows that our solver is much faster than existing libraries like cuSPARSE, MAGMA, ViennaCL, and Ginkgo, especially when multiple linear equations with coefficient matrices sharing the same nonzero pattern are solved.
Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning
2018, Journal of Parallel and Distributed Computing
Citation Excerpt :
In the symmetric case, the amount of parallelism in level-scheduled sparse triangular solves can be increased by using multicolor reordering of the rows and columns of the original matrix and computing the incomplete Cholesky factors of this reordered matrix [24]. However, multicolor reorderings generally give poorer PCG convergence results compared to other orderings such as the band reducing RCM ordering [12] or profile reducing Sloan ordering [34] (see, for example, [8,14,15,17,18,29]). For some “easy” problems, the convergence rate may be degraded by as much as 60–100% but this can be compensated for by the additional parallelism in the solves.
When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure.
A fine-grained block ILU scheme on regular structures for GPGPUs
2015, Computers and Fluids
Citation Excerpt :
Other techniques such as coloring or graph partitioning allow efficient coarse-grained to mid-grained parallelism with little overhead (see, for example [2–6]), at the expense of slightly reduced mathematical efficiency. More detailed reviews can be found in [7–9]. PDE solvers on large domains often employ message passing parallelism.
Iterative methods based on block incomplete LU (BILU) factorization are considered highly effective for solving large-scale block-sparse linear systems resulting from coupled PDE systems with n equations. However, efforts on porting implicit PDE solvers to massively parallel shared-memory heterogeneous architectures, such as general-purpose graphics processing units (GPGPUs), have largely avoided BILU, leaving their enormous performance potential unfulfilled in many applications where the use of implicit schemes and BILU-type preconditioners/solvers is highly preferred. Indeed, strong inherent data dependency and high memory bandwidth demanded by block matrix operations render naive adoptions of existing sequential BILU algorithms extremely inefficient on GPGPUs. In this study, we present a fine-grained BILU (FGBILU) scheme which is particularly effective on GPGPUs. A straightforward one-sweep wavefront ordering is employed to resolve data dependency. Granularity is substantially refined as block matrix operations are carried out in a true element-wise approach. Particularly, the inversion of diagonal blocks, a well-known bottleneck, is accomplished by a parallel in-place Gauss–Jordan elimination. As a result, FGBILU is able to offer low-overhead concurrent computation at $O (n^{2} N^{2})$ scale on a 3D PDE domain with a linear scale of N. FGBILU has been implemented with both OpenACC and CUDA and tested as a block-sparse linear solver on a structured 3D grid. While FGBILU remains mathematically identical to sequential global BILU, numerical experiments confirm its exceptional performance on an Nvidia GPGPU.
A parallel computational framework to solve flow and transport in integrated surface-subsurface hydrologic systems
2014, Environmental Modelling and Software
Citation Excerpt :
Based on these results, it is concluded that the node reordering plays a pivotal role in reducing the number of iterations for these types of simulations. Note that it is well known that node reordering can significantly influence the solver efficiency (Doi and Lichnewsky, 1991; D'Azevedo et al., 1992; Doi and Washio, 1999) and the results in this study generally agree with the results reported in previous studies. Table 6 lists the overall serial computing times, TS, for the simulations performed on GPC and TCS.
Hydrologic modeling requires the handling of a wide range of highly nonlinear processes from the scale of a hill slope to the continental scale, and thus the computational efficiency of the model becomes a critical issue for water resource management. This work is aimed at implementing and evaluating a flexible parallel computing framework for hydrologic simulations by applying OpenMP in the HydroGeoSphere (HGS) model. HGS is a 3D control-volume finite element model that solves the nonlinear coupled equations describing surface–subsurface water flow, solute migration and energy transport. The computing efficiency of HGS is improved by three parallel computing schemes: 1) parallelization of Jacobian matrix assembly, 2) multi-block node reordering for performing LU solve efficiently, and 3) parameter privatization for reducing memory access latency. Regarding to the accuracy and consistency of the simulation solutions obtained with parallel computing, differences in the solutions are entirely due to use of a finite linear solver iteration tolerance, which produces slightly different solutions which satisfy the convergence tolerance. The maximum difference in the head solution between the serial and parallel simulations is less than 10⁻³ m, using typical convergence tolerances. Using the parallel schemes developed in this work, three key achievements can be summarized: (1) parallelization of a physically-based hydrologic simulator can be performed in a manner that allows the same code to be executed on various shared memory platforms with minimal maintenance; (2) a general, flexible and robust parallel iterative sparse-matrix solver can be implemented in a wide range of numerical models employing either structured or unstructured mesh; and (3) the methodology is flexible, especially for the efficient construction of the coefficient and Jacobian matrices, compared to other parallelized hydrologic models which use parallel library packages.
Fast linear equation solvers in high performance electromagnetic field analysis
2002, Journal of Computational and Applied Mathematics
Solving linear equations plays a crucial role in high performance electromagnetic field analysis. We describe forms and characteristics of a system of linear equations arising in electromagnetic field analysis with finite element method (FEM). Properties of ICCG and its parallelization are discussed in context of electromagnetic field analyses. Although current applicability of multigrid approach is rather limited in electromagnetic field analysis in comparison with ICCG, the multigrid method is important because it is quite fast when applied to very large-scale problems. We discuss the algebraic multigrid method in finite element electromagnetic field analysis.
Preconditioning techniques for large linear systems: A survey
2002, Journal of Computational Physics
This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices. Covered topics include progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization issues, and block and multilevel extensions. Some of the challenges ahead are also discussed. An extensive bibliography completes the paper.

View all citing articles on Scopus

View full text

Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations

Abstract

Introduction

Section snippets

Preliminaries

Model problem

Relation between ordering graph structure and the remainder matrix R

Design (implementation on vector multiprocessors)

Numerical experiments

Concluding remarks

Acknowledgements

J. Comput. Appl. Math.

Appl. Numer. Math.

Linear Algebra Appl.

Comput. Phys. Comm.

Comput. Phys. Comm.

Some parallel and vector implementations of preconditioned iterative methods on cray-2

Int. J. High Speed Comput.

Large-numbered multicolor milu preconditioning on SX-3/14

Int. J. Comput. Math.

Solving linear systems on vector and shared memory computers

Numerical linear algebra for high-performance computers