Elsevier

Parallel Computing

Volume 25, Issues 13–14, December 1999, Pages 1995-2014
Parallel Computing

Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations

https://doi.org/10.1016/S0167-8191(99)00064-2Get rights and content

Abstract

This paper is concerned with the parallel implementation of the incomplete factorization preconditioned iterative method. Although the use of such parallel ordering as multicolor ordering may increase parallelism in factorization, it often slows convergence when used in the preconditioned method, and thus may offset the gain in speed obtained with parallelization. Further, the higher the parallelism of an ordering, the slower the convergence; the lower the parallelism, the faster the convergence. This well-known trade-off between parallelism and convergence is well explained by the property of compatibility, the level of which can be clearly seen when ordering is presented in graph form (S. Doi, A. Lichnewsky, A graph-theory approach for analyzing the effects of ordering on ILU preconditioning, INRIA report 1452, 1991). In any given method, the fewer the incompatible local graphs in an ordering (i.e., the lower the parallelism), the faster the convergence (S. Doi, Appl. Numer. Math. 7 (1991) 417–436; S. Doi, in: T. Nodera (Ed.), Advances in Numerical Methods for Large Sparse Sets of Linear Systems, 7 Keio University, 1991). An ordering with no incompatible local graphs, for example, such as that implemented on vector multiprocessors by using the nested dissection technique, will have excellent convergence, but its parallelism will be limited (S. Doi, A. Lichnewsky, Int. J. High Speed Comput. 2 (1990) 143–179). To attain a better balance, a certain degree of incompatibility is necessary. In this regard, increasing the number of colors in multicolor ordering can be a useful approach (S. Fujino, S. Doi, in: R. Beauwens (Ed.), Proceeding of the IMACS Internation Symposium on Iterative Methods in Linear Algebra, March 1991; S. Doi, A. Hoshi, Int. J. Comput. Meth. 44 (1992) 143–152). Two related techniques also presented here are the overlapped multicolor ordering (T. Washio, K. Hayami, SIAM J. Sci. Comput. 16 (1995) 631–650), and a fill-in strategy selectively applied to incompatible local graphs. Experiments conducted with an SX-5/16A vector parallel supercomputer show the relative effectiveness of increasing the number of colors and also of using this approach in combination with overlapping and with fill-ins.

Introduction

A class of the most effective methods for solving a system of large sparse linear equations is the preconditioned iterative method, which can be defined as a basic iterative method applied to a preconditioned system M−1Au=M−1b (where M is the preconditioner) instead of being applied to the original system Au=b. The objective of the preconditioning is to reduce the condition number (or to cluster the eigenvalues) of the original system so as to reach an approximate solution with fewer iterations. This requires solving the system Mv=g. Hence, a good preconditioner needs to satisfy two requirements: it should be a good enough approximation of A to produce a reduced condition number (or better clusters of eigenvalues), and the system Mv=g should be much easier to solve than the original system. Further, increases in the parallelism of recent computer architectures have led to a new requirement: the solution of Mv=g must also have high enough parallelism to be mapped naturally on the computer to be used.

One type of commonly used preconditioning is represented by the incomplete factorization M=LU, where L and U are lower and upper triangular matrices of a sparsity structure similar to A, and are produced by neglecting certain fill-ins in the Gaussian elimination [19]. One way to attain parallelism in the solution of Mv=LUv=g (in practice, this solution is obtained first by solving Lw=g (forward substitution) and then by solving Uv=w (backward substitution)) is to reorder the system Au=b and to reconstruct an incomplete factorization for the reordered system. Actually, a number of techniques developed in the past do this [1], [3], [5], [7], [10], [12], [13], [14], [17], [18], [21]. In this regard, one possibility for this approach is to use the well-known red–black ordering. Solving Lw=g, where w has a red–black ordering can be performed simultaneously for half the unknowns in v; this is also true of the solving of Uv=w. In this case, however, a fundamental difficulty is presented by the trade-off problem between parallelism and convergence: in a system to which the incomplete factorization preconditioned iterative method applied, higher parallelism in the ordering will result in slower convergence, and higher convergence will require lower parallelism. This trade-off problem was first reported by Duff and Meurant [10] on the basis of intensive numerical tests using many orderings [10]. Since then, many attempts have been made to answer the following two fundamental questions: why there is a trade-off between parallelism and convergence in incomplete factorizations and how can the trade-off be overcome [2], [3], [6], [10], [11], [16], [20]? Before we give our own answers to these questions, it is worth repeating the remarks of Duff and Meurant [10] which are still valuable:

  • (DM1) The rate of convergence is almost directly related to the norm of the remainder matrix R=MA, but not related to the number of fill-ins dropped.

  • (DM2) It appears that the harder the problem (discontinuous coefficients, anisotropy, etc.) is, the more important the ordering for the incomplete factorization is.

  • (DM3) A single level of fill-in changes the relative performance of the different ordering schemes; for example, mind, rb, and altd (parallel orderings) do somewhat better when some fill-in is allowed in L. One reason for this is that many of the first-level fill-ins for these orderings are quite large, unlike row, for example, where the fill-ins rapidly decrease in value.

Kuo and Chan [16] proved that the condition number of a preconditioned system based on a red–black ordering is only about 1/4 of that of an unpreconditioned system, showing no asymptotic improvement no matter how small grid size h becomes [16]. Eijkhout [11] used an analysis of the infinity norm of the remainder matrix R=MA to create a criterion by which orderings for no-fill factorization might be placed in one of two categories: whether or not an ordering contains nodes that are eliminated before both neighbors in one direction are eliminated [11].

Doi and Lichnewsky [6] attempted to answer the first question “why is there a trade-off(?)” by utilizing graph representations of orderings [6]. They showed that when an ordering graph contains incompatible (local) graph, the components of the remainder matrix R corresponding to the incompatible graph are significantly larger than the components corresponding to any single compatible graph. Furthermore, while components corresponding to a compatible graph converge to zero with dynamic increase or decrease in the value of model equation parameters, (e.g., grid aspect ratio, anisotropy, etc.) components corresponding to an incompatible graph do not. Another interesting observation is that the components of rk=bAuk (the residual vector after k iterations with a preconditioned iterative method) corresponding to incompatible local graphs have significantly larger values than those corresponding to compatible local graphs.1 An ordering which has no incompatible graphs is called compatible. Doi and Lichniewsky also confirmed that the trade-off observed by Duff and Meurant could be explained very well by means of the quality of compatibility. With respect to parallel computing, the incompatible local graph corresponds to a potential starting point for forward substitution by L. This may explain the trade-off; that is, an ordering having higher parallelism has more incompatible local graphs that break the compatibility of the ordering and thus decrease the rate of convergence. It is also interesting to see that compatibility is equivalent to Eijkhout's criterion, although his study was performed independently. Doi and Lichnewsky [5] implemented several incomplete factorizations based on compatible orderings on a Cray-2 shared memory vector multiprocessor using the nested dissection technique, and showed that these orderings can be effectively mapped onto vector multiprocessors [5]. Doi [3], [4] did a more quantitative study of this trade-off problem by using two parameterized orderings, parallel block PB(m) and parallel diagonal PD(m) (where m is a parameter that controls the degree of parallelism), and showed that the rate of convergence degrades rather smoothly as the degree of parallelism increases [3]. A pragmatic solution designed to strike a better balance between parallelism and convergence is the use of 50 or more colors in multicolor ordering, rather than the conventionally used four or eight colors [4], [7], [13]. Such a large-numbered multicolor ordering technique attains high performance on an SX-3/14 vector supercomputer with minor degradation in convergence [7].

Some positive techniques have also been proposed to compensate for the significantly larger components of R that corresponding to the incompatible graphs which occasionally appear in large-numbered multicolor incomplete factorization. Washio and Hayami [24], for example, introduced the overlapping technique. The key idea is to compensate for those components by repeating parts of the forward and backward substitutions [24]. It is also possible to apply the fill-in technique to incompatible graphs. Since those appear only very rarely in large numbered multicolor ordering, this fill-in does not significantly reduce parallelism, and the cost of the fill-ins is sufficiently low.

Section 2 of this paper introduces some fundamental definitions, including equivalence and compatibility in ordering. Section 3 presents analysis that answers the question “why is there a trade-off?” Section 4 introduces ordering strategies and related techniques; those are in some sense answers to the question “how can we overcome the tradeoff?” Section 5 presents the results of numerical experiments conducted on an SX-5 vector parallel supercomputer. Our results show that a large-numbered multicolor MILU–BiCGSTAB [23] method, combined with either overlapping or selective fill-ins, can solve some discrete convection–diffusion equations, discretized on a 1993 grid, within a time on the order of 1 s, where sustained speed is about 40 Gflops on an SX-5/16A (16 CPUs, peak speed of 128 Gflops).

Section snippets

Preliminaries

This section gives definitions and theorems necessary to support the discussion in the sections which follow it. Readers are also advised to refer to [8], [9], which describe some notions and technical terms appearing here without definition.

Model problem

The discussion here is based on a model problem: a 5-point (or 7-point) finite difference discretization of a 2D (or 3D) convection–diffusion equation in the form−∇·(K∇)u+V·∇u=f,as defined on a rectangular (or cubic) domain Ω. The elements of K(=diag[kx,ky,kz]) and V(=[vx,vy,vz]) are assumed to be constant in Ω. Dirichlet boundary conditions are imposed on Ω.

Relation between ordering graph structure and the remainder matrix R

Since the rate of convergence of the ILU preconditioned iterative method is directly related to the norm of remainder matrix R=MA, it is

Design (implementation on vector multiprocessors)

The analysis presented in previous section serves as the basis for a parallel ordering design that strikes a reasonable balance between parallelism and convergence, one which can be implemented efficiently on actual parallel computers.

Numerical experiments

This section report results of numerical experiments conducted on an SX-5/16A vector parallel supercomputer with 16 high-performance vector processors (128 Gflops peak speed). The experiments were applied to model problem (1) (model parameters are given in Appendix A). A 199×199×199 grid was used, which produced nearly eight million unknowns.

We used the BiCGSTAB method [23] as our basic iterative method, in which four types of preconditioning were applied: (1) hyperplane MILU, (2)

Concluding remarks

In this paper, we have discussed ordering strategies and related techniques for overcoming the trade-off between parallelism and convergence that is observed in incomplete factorizations. Graph representation of ordering is important since it gives a one-to-one correspondence between each graph and each set of equivalent orderings that have the same convergence as is obtained with preconditioned iterative methods, and with graph representations, the property of compatibility can be seen to

Acknowledgements

The authors would like to thank Mr. Akira Asami of NEC Informatec Systems for conducting the tuning of our programs and for testing them on the SX-5 supercomputer.

References (25)

  • J.J. Dongarra et al.

    Solving linear systems on vector and shared memory computers

    (1990)
  • J.J. Dongarra et al.

    Numerical linear algebra for high-performance computers

    (1998)
  • Cited by (38)

    • Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning

      2018, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      In the symmetric case, the amount of parallelism in level-scheduled sparse triangular solves can be increased by using multicolor reordering of the rows and columns of the original matrix and computing the incomplete Cholesky factors of this reordered matrix [24]. However, multicolor reorderings generally give poorer PCG convergence results compared to other orderings such as the band reducing RCM ordering [12] or profile reducing Sloan ordering [34] (see, for example, [8,14,15,17,18,29]). For some “easy” problems, the convergence rate may be degraded by as much as 60–100% but this can be compensated for by the additional parallelism in the solves.

    • A fine-grained block ILU scheme on regular structures for GPGPUs

      2015, Computers and Fluids
      Citation Excerpt :

      Other techniques such as coloring or graph partitioning allow efficient coarse-grained to mid-grained parallelism with little overhead (see, for example [2–6]), at the expense of slightly reduced mathematical efficiency. More detailed reviews can be found in [7–9]. PDE solvers on large domains often employ message passing parallelism.

    • A parallel computational framework to solve flow and transport in integrated surface-subsurface hydrologic systems

      2014, Environmental Modelling and Software
      Citation Excerpt :

      Based on these results, it is concluded that the node reordering plays a pivotal role in reducing the number of iterations for these types of simulations. Note that it is well known that node reordering can significantly influence the solver efficiency (Doi and Lichnewsky, 1991; D'Azevedo et al., 1992; Doi and Washio, 1999) and the results in this study generally agree with the results reported in previous studies. Table 6 lists the overall serial computing times, TS, for the simulations performed on GPC and TCS.

    • Fast linear equation solvers in high performance electromagnetic field analysis

      2002, Journal of Computational and Applied Mathematics
    View all citing articles on Scopus
    View full text