Parallel SSOR preconditioning implemented on dynamic SMP clusters with communication on the fly

https://doi.org/10.1016/j.future.2009.05.005Get rights and content

Abstract

The paper presents a comparative analysis of parallel implementation of the preconditioned conjugate gradient method with the symmetric-successive over-relaxation preconditioner. Two parallel implementations of the matrix solver are compared. The first one is a message-passing version executed on a cluster of workstations. The other one is an efficient version simulated on a novel architecture of dynamically reconfigurable shared memory clusters with a new paradigm of inter-processor communication called communication on the fly. The presented example shows high suitability of the proposed architecture for fine grain numerical computations. It can be very useful in the simulation of physical phenomena described as numerical problems suitable for fine grain parallel execution.

Introduction

The finite element (FE) technique is one of the most popular approximated methods in scientific computations and computer aided engineering software. Different kinds and mixed formulations of the FE method are especially useful in computational electromagnetics (CEM) [1], [2]. The method replaces the original continuous problem, described by the partial differential equations, by some large linear or nonlinear matrix equations. An efficient formulation of the matrix subroutine is extremely important in the time domain electromagnetic computations (TD-CEM). The computation of transient states requires step-by-step integration of the algebraic matrix equation Aeτ=bτ derived from Maxwell’s equations [1], where eτ denotes a computed distribution of time-dependent electromagnetic field. A wide spectrum of electromagnetic phenomena and material properties results in a large diversity of the form of the matrix A. The properties of the matrix A depend on the type and order of the implemented elements, quality of the mesh, types of boundary conditions, and some others factors. The efficient solution of the CEM problems requires some smart solver. It should be flexible and portable to different problem formulations as well as work on different hardware platforms.

A preconditioned conjugate gradient algorithm (PCG) is one of the most suitable methods in CEM [3], [4]. The ill conditioning of the matrix A results in a deterioration of the convergence of the CG solver. The performance of computations can be improved by an appropriate and efficiently implemented preconditioner Aeτ=bpreconditioningMAeτ=Mb, where M is a preconditioner matrix. The selection of the proper solver and/or preconditioner is not a trivial task. A useful definition of the proper preconditioner (sequential and parallel) implies absolute minimization of time of computation of the matrix, see Eq. (2). In this way the improvement of convergence outweighs the computational cost of the preconditioning. There are no simple rules for the selection of a preconditioner, therefore an expert system seems to be the best tool. Some features of the computed problems (e.g. structure of the matrix A, its norms, conditioning number and eigenvalues) can be analyzed using either some artificial intelligence methods or heuristic algorithms. This kind of an environment and expert system for distributed platforms is developed within the SANS project (Self-Adapting Numerical Software) [5]. It can be used as a user friendly, transparent tool for the selection of the proper solver. The significant part of the project is a database, where some information about solvers as well as some features of the problems can be collected.

The next question is the efficient implementation of the solver [6], [7], [8]. Since the data and the tasks are partitioned in the parallel implementation, some specific properties of the applied hardware platform must be taken into account. The optimization of distributed preconditioning algorithms is aimed at decreasing communication cost as well as improving convergence of iterative calculation. Besides some universal methods (derived from sequential versions) [7], [9], the parallel formulations are developed with some peculiar techniques [4], [5], [10], [11], [12], [13]:

  • 1.

    Optimization of partitioning scheme to obtain the desired locality of data and ideal parallelization of matrix processing. The partitioning techniques connected with the finite element method can be based on the geometry of the computed model and some physical constraints (e.g. boundary conditions in the model).

  • 2.

    Implementation of a partitioning paradigm and formulation of some multilevel methods.

  • 3.

    Optimization of parallel processing of sparse matrices using some direct algebraic transformations of the assembled matrices, e.g. some block and block-striped mappings, some red–black and multi-coloring schemes.

  • 4.

    Modification of the structure of tasks and communication pattern in the algorithm, e.g. overlapping and optimization of relations between the local/sequential data processing and data transfers, implementation of non-blocking data transfers, optimization of load balancing.

  • 5.

    Implementation and adjustment of some heuristic modifications of the complete matrix equation (e.g. a localized threshold value-based methods, lumping and collocation of the matrix).

The parallel formulation of the symmetric-successive over-relaxation (SSOR) preconditioner is discussed in this paper. It is based on the complete form of the algorithm and no methods of the matrix reordering or problem modification are implemented. This form of the algorithm is particularly hard to parallelize, because the triangular matrix equation must be solved in each iteration [9], [10]. The distributed version of the SSOR looses some advantages of the sequential form. A two-step domain and task decomposition paradigms are used in the implemented SSOR subroutines. The applied two-dimensional matrix decomposition enables flexible modification of communication pattern, and tuning the relation between data transfers and sequential computations. The granulation of the algorithm can be flexibly adjusted even during execution. In this way, the performance of this stage can be adjusted to current available computational power and specific properties of the communication network. From the general point of view, this form of the algorithm can be applied on any hardware multicomputer/multiprocessor platform with distributed memory. The cluster of workstations (COW) platform as a common, popular testbed in practical real simulation, was first used to check the performance of the proposed algorithm [14]. The algorithm is used to approximate a time-dependent distribution of high-frequency electromagnetic wave in a benchmark model with a perfect conducting sphere suspended in a free space. Next, the algorithm was ported onto a system of dynamic shared memory multiprocessor clusters (SMP) with communication on the fly [15]. This runtime-configurable multiprocessor platform constitutes a new class of hardware. Its architecture should be particularly helpful in computation and data transfer intensive algorithms. The idea of this platform enables the overlapping data processing and data access commands. In this system, a new method for data exchange between processor clusters (communication on the fly) is applied. It is based on the combination of porting the data in caches of processors switched between clusters with the distribution of data in target clusters by means of simultaneous reads into many processor data caches by snooping of data exposed on the cluster shared memory bus. Such architecture can be used to build a dynamically configurable parallel embedded subsystem aimed in speeding up time consuming numerical computations, particularly oriented to fine grain numerical simulation. Specific features of the platform coincide with the presented structure of the SSOR preconditioner and the PCG algorithm. Simulation experiments with implementation of the distributed SSOR preconditioner have confirmed strong advantage of the proposed architecture over a classical cluster of workstations.

Section snippets

Problem formulation

The presented formulation of the algorithm was validated using a benchmark problem derived from computational electromagnetic. The three-dimensional model with a perfect conducting sphere placed in a free space and illuminated by a propagated high-frequency (f=2GHz) sinusoidal plain wave is considered [16]. According to mathematical background of the finite element method, the investigated model of time-dependent electromagnetic phenomena is translated to its discrete form (see Appendix A). The

Message-passing implementation

The construction and performance of the algorithm are modified by the row-wise domain decomposition of the matrices. The form and final properties of the forward calculation, formulation of diagonal matrix and backward substitution are changed in a different way. The matrix A is processed by columns in the forward calculation stage, while the backward substitution is made by rows. The partial results of these stages, the vectors u and v, respectively, are placed in separate memory spaces of

Implementation in shared memory dynamic clusters

The presented formulation of the SSOR preconditioner has been ported to a parallel system based on dynamic switching of processors between shared memory clusters and data reads on the fly inside clusters [15]. The elementary module of the system is composed of N processor nodes {PEi} and M memory modules {MEMj}, shared by all processors (Fig. 3). Processor nodes can be dynamically connected to memory busses under control of bus arbiters to create dynamic SMP clusters. The system can contain a

Conclusions

Parallel implementation of the PCG-SSOR algorithm in two executive environments: classic cluster of workstations and system of dynamic shared memory clusters with communication on the fly has been discussed in the paper. A classic cluster of workstations with MPI communication library appeared to be inadequate for parallelization of the SSOR preconditioner. Two level task decomposition and the variable number of independently processed sub-matrices applied in the forward calculations and the

Boguslaw Butrylo is with Bialystok Technical University, Faculty of Electrical Engineering, Poland. He received the Ph.D. in electrical engineering from Bialystok Technical University in 2000. Since 2001 he is the adjunct in the Department of Theoretical Electrotechnics and Metrology. His research interests include high performance modeling of electromagnetic and thermal phenomena, optimization of equipment related to distribution of electromagnetic and/or thermal field.

References (20)

There are more references available in the full text version of this article.

Cited by (2)

Boguslaw Butrylo is with Bialystok Technical University, Faculty of Electrical Engineering, Poland. He received the Ph.D. in electrical engineering from Bialystok Technical University in 2000. Since 2001 he is the adjunct in the Department of Theoretical Electrotechnics and Metrology. His research interests include high performance modeling of electromagnetic and thermal phenomena, optimization of equipment related to distribution of electromagnetic and/or thermal field.

Marek Tudruj is professor and head of the Chair of Parallel Computing and the Supercomputer Laboratory in the Polish-Japanese Institute of Information Technology in Warsaw; at the same time he is the head of the Computer Architecture Group in the Institute of Computer Science of the Polish Academy of Sciences in Warsaw. He has obtained his Ph.D. and D.Sc. degrees from the Institute of Computer Science of the Polish Academy of Sciences in 1979 and 1994. He has been involved in many research projects concerned with parallel and distributed computer systems architecture, methodology of parallel and distributed system programming, supporting tools for parallel and distributed program design. He is the author or co-author of about 130 research papers published in the proceedings of international conferences and journals.

Lukasz Masko received the M.Sc. degree in informatics from the Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw in 2000, where he also obtained a Bachelor’s degree in mathematics. From 2000 he is an assistant in the Institute of Computer Science of the Polish Academy of Sciences. His research interests include parallel processing, parallel systems architecture, distributed and Grid computing. He also teaches computer science in the Polish-Japanese Institute of Information Technology.

View full text