Elsevier

Parallel Computing

Volume 57, September 2016, Pages 154-166
Parallel Computing

Implementation of the efficient communication layer for the highly parallel total FETI and hybrid total FETI solvers

https://doi.org/10.1016/j.parco.2016.05.002Get rights and content

Highlights

  • Implementation, performance, and scalability results of communication layer for Total FETI and Hybrid Total FETI solver.

  • In HTFETI several neighboring subdomains aggregated into clusters. This reduces the size of coarse problem and improves scalability.

  • Optimization of nearest neighbor communication - global gluing matrix.

  • Implementation of communication hiding and avoiding techniques inside the communication layer

  • Benchmarks - elastic 3D cube up to 1.6 billion DOF and realistic car engine benchmark.

  • Large test executed on Total FETI to see the real potential of communication layer on smaller clusters.

Abstract

This paper describes the implementation, performance, and scalability of our communication layer developed for Total FETI (TFETI) and Hybrid Total FETI (HTFETI) solvers. HTFETI is based on our variant of the Finite Element Tearing and Interconnecting (FETI) type domain decomposition method. In this approach a small number of neighboring subdomains is aggregated into clusters, which results in a smaller coarse problem. To solve the original problem TFETI method is applied twice: to the clusters and then to the subdomains in each cluster.

The current implementation of the solver is focused on the performance optimization of the main CG iteration loop, including: implementation of communication hiding and avoiding techniques for global communications; optimization of the nearest neighbor communication - multiplication with a global gluing matrix; and optimization of the parallel CG algorithm to iterate over local Lagrange multipliers only.

The performance is demonstrated on a linear elasticity 3D cube and real world benchmarks.

Introduction

The goal of this paper is to describe parallelization and optimization techniques and algorithms required to implement efficient communication layer for the Finite Element Tearing and Interconnecting (FETI) based parallel solvers. The efficient communication layer is essential for good scalability in the cluster environment. It must be able to run on several thousands of MPI processes and achieve minimal communication overhead. The method that is mainly used for performance evaluation of the communication layer in this paper is the Total FETI with one subdomain per MPI process and one MPI process per CPU core. This configuration uses high number of MPI ranks to solve a problem and therefore relies mainly on the communication layer. This paper also introduces the Hybrid Total FETI (HTFETI) method, which can process several hundreds of small subdomains per MPI process and efficiently run with only one MPI process per node. This means that if TFETI method runs on 400 nodes with 20 MPI processes per node HTFETI will run on 8000 compute nodes.

HTFETI method is based on our variant of the FETI type domain decomposition method called Total FETI (TFETI) [6]. The original FETI method, also called the FETI-1 method, was originally introduced for the numerical solution of the large linear systems arising in linearized engineering problems by Farhat and Roux [2]. In the FETI methods a body is decomposed into several non-overlapping subdomains and the continuity between the subdomains is enforced by Lagrange multipliers. Using the theory of duality, a smaller and relatively well conditioned dual problem can be derived and efficiently solved by a suitable variant of the conjugate gradient algorithm.

The original FETI algorithm, where only the favorable distribution of the spectrum of the dual Schur complement matrix [3] was considered, was efficient only for a small number of subdomains. So it was later extended by introducing a natural coarse problem [4], [5], whose solution was implemented by auxiliary projectors so that the resulting algorithm became in a sense optimal [4], [5].

In the TFETI method [6], also the Dirichlet boundary conditions are enforced by Lagrange multipliers. Hence all subdomain stiffness matrices are singular with a priori known kernels, which is a great advantage in the numerical solution. With the known kernel basis we can regularize effectively the local stiffness matrix [10] and use any standard Cholesky type decomposition method for nonsingular matrices.

Even if there are several efficient coarse problem parallelization strategies [7], there are still size limitations of the coarse problem. So several hybrid (multilevel) methods were proposed [8], [9]. The key idea is to aggregate small number of neighboring subdomains into clusters, which naturally results in smaller coarse problem. In our HTFETI, the aggregation of subdomains into the clusters is enforced again by Lagrange multipliers. Thus the TFETI method is used on both the cluster and subdomain levels. This approach allows parallelization of the original problem up to tens of thousands of cores, which is not reachable with standard FETI methods (difficulties with the large coarse problem). However, convergence of the HTFETI method is slower and therefore the number of iterations is higher when compared to the TFETI method. This means that for smaller problems TFETI remains more efficient and recommended method. But our ultimate goal is to compute extremely large problems decomposed into such a high number of subdomains which are not solvable by the standard FETI methods.

Section snippets

Matrix formulation

In this paper, we use the notation introduced in [1]. Let us consider a model problem from linear elasticity. The isotropic elastic body occupies a domain ΩRd,d=2,3, with sufficiently smooth boundary Γ. To apply the HTFETI approach to solve such problem, we first of all tear the body from the part of the boundary with the Dirichlet boundary condition as in the TFETI approach. Then we decompose the body into non-overlapping clusters and the clusters into non-overlapping subdomains. Finally, we

Total FETI and hybrid total FETI solver

Our TFETI and HTFETI solver is implemented in pure C++. Significant part of the development effort was devoted to writing a C++ wrapper for (1), the selected sparse and dense BLAS routines, and (2) the sparse direct solvers (MKL version of PARDISO[12] sparse direct solver) of the Intel MKL library [11].

Since the solver development is mainly focused on the current and future multi and many core architectures, in particular the Intel MIC architecture, the Intel MKL library is the only external

Numerical experiments

The described algorithms were implemented in our new library, developed in C++ environment and tested on the solution of 2D and 3D linear elasticity problems. We varied the decomposition and discretization parameters in order to demonstrate the scalability of our method.

The benchmarks were executed on two European supercomputers: (1) Anselm located at IT4Innovations in the Czech Republic and (2) Cartesius located at SurfSara in the Netherlands. The machines have following parameters

Conclusion

The current implementation of the solver is primarily focused on the performance optimization of the main CG solver iteration loop, including: implementation of communication hiding and avoiding techniques for global communications; optimization of the nearest neighbor communication - multiplication with a global gluing matrix; optimization of the parallel CG algorithm to iterate only over local Lagrange multipliers. In other words, we focused on the development of the highly scalable FETI

Acknowledgment

This paper has been elaborated in the framework of the project called New creative teams in priorities of scientific research, reg. no. CZ.1.07/2.3.00/30.0055, supported by Operational Programme Education for Competitiveness and co-financed by the European Social Fund and the state budget of the Czech Republic and by Grant Agency of the Czech Republic GAČR grant 13-30657P.

This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU

References (12)

  • C. Farhat et al.

    Optimal convergence properties of the FETI domain decomposition method

    Comput. Methods Appl. Mech. Eng.

    (1994)
  • M. Jarošová et al.

    Hybrid total FETI method

    (2012)
  • C. Farhat et al.

    An unconventional domain decomposition method for an efficient parallel solution of large-scale finite element systems

    SIAM J. Sci. Stat. Comput.

    (1992)
  • F.-X. Roux

    Spectral analysis of interface operator

  • F.-X. Roux et al.

    Parallel implementation of direct solution strategies for the coarse grid solvers in 2-level FETI method

    Contemporary Math.

    (1998)
  • Z. Dostál et al.

    Total FETI - an easier implementable variant of the FETI method for numerical solution of elliptic PDE

    Commun. Numer. Methods Eng.

    (2006)
There are more references available in the full text version of this article.

Cited by (0)

View full text