Acceleration of boundary element method by explicit vectorization

https://doi.org/10.1016/j.advengsoft.2015.04.008Get rights and content

Highlights

  • The in-core vectorization of the Galerkin BEM using the Vc library is proposed.

  • Fully numerical and semi-analytical integration schemes are discussed.

  • Numerical experiments show significant speedup of the BEM computation.

Abstract

Although parallelization of computationally intensive algorithms has become a standard with the scientific community, the possibility of in-core vectorization is often overlooked. With the development of modern HPC architectures, however, neglecting such programming techniques may lead to inefficient code hardly utilizing the theoretical performance of nowadays CPUs. The presented paper reports on explicit vectorization for quadratures stemming from the Galerkin formulation of boundary integral equations in 3D. To deal with the singular integral kernels, two common approaches including the semi-analytic and fully numerical schemes are used. We exploit modern SIMD (Single Instruction Multiple Data) instruction sets to speed up the assembly of system matrices based on both of these regularization techniques. The efficiency of the code is further increased by standard shared-memory parallelization techniques and is demonstrated on a set of numerical experiments.

Introduction

The boundary element method (BEM) is a counterpart to the finite element method (FEM) suitable for the solution of partial differential equations which can be formulated in the form of boundary integral equations. Since BEM reduces the given problem to the boundary of a computational domain, it is especially suitable for problems stated in unbounded domains, such as acoustic or electromagnetic wave scattering, or shape optimization.

System matrices arising from the classical BEM are dense and the method has quadratic computational and memory complexity with respect to the number of surface elements. Moreover, special quadrature methods are needed due to the singularities in the kernels of the boundary integrals [1], [2], which further contribute to computational demands of the method. Several fast BEM approaches can be employed to reduce computational and memory requirements to almost linear. The common methods are based on the decomposition of the surface mesh into clusters and subsequent low rank approximation of matrix blocks corresponding to admissible pairs of clusters. Nonadmissible blocks are assembled in the standard way as full rank matrices. The fast multipole method (FMM) is based on the approximation of the system matrices by the multipole series expansion [3], [4], [5], whereas the adaptive cross approximation (ACA) assembles the low rank approximation from an algebraic point of view [6], [2].

Regardless the above mentioned approximation techniques, there is still a need for an efficient assembly of nonadmissible matrix blocks or a certain number of rows and columns of admissible blocks in the case of ACA. Since these blocks are usually too small to be distributed among computational nodes by MPI, an OpenMP parallelization of the assembly is an obvious choice. In this paper, we discuss further acceleration of the process by means of vectorization of the quadrature over pairs of surface elements.

With new SIMD instruction sets available in modern processors the usage of vectorization becomes more important in scientific computation. Neglecting it may lead to inefficient code not capable of reaching the theoretical performance of current CPUs. The SSE instruction set introduced by Intel in 1999 provided eight 128-bit registers and enabled concurrent operations on four 32-bit single-precision floating point numbers. Its successors, SSE2–SSE4, extended this capability to support SIMD operations on two 64-bit double-precision floating point operands while incrementally adding more instructions. The AVX instruction set supported by Intel processors since 2011 extends the registers length from 128 bits to 256 bits and introduces a three-operand SIMD instruction format. Its capabilities are further extended by AVX2. The AVX-512 should provide registers with 512-bit length allowing for concurrent operation on eight 64-bit double precision numbers. Its support is announced for Intel’s Knights Landing processor available in 2015 and for Intel’s Skylake microprocessor architecture [7].

To use the vector instructions the existing scalar code usually has to be modified. While the automatic loop vectorization provided by the compiler is not capable of vectorizing more complex loops often occurring in scientific codes, exploiting the supported intrinsic functions may lead to a confusing and hardly maintainable code. One of the possibilities avoiding these issues is to use a higher level library, such as VML from Intel’s Math Kernel Library [8], VDT [9], or the Vc library [10], which is the main focus of this paper. The library provides a high level wrapper on SIMD intrinsics and enables explicit vectorization of C++ code. It is portable among various compilers and SIMD instruction sets and enables easy vectorization without the need for a major redesign of the existing object oriented C++ code.

The topic of the vectorization of the BEM computation has been presented in several publications. In [11] an example of automatic loop vectorization of Fortran boundary element computation is provided. The original routines are manually altered using techniques such as loop unrolling and loop reordering in order to enable the compiler to employ SIMD instructions. Although a reasonable speedup with respect to the non-vectorized version is obtained, modifications lead to a significantly more complex code. The interested reader may also consult [12] for a comprehensive presentation of BEM quadrature vectorization. The author provides a general overview of the SIMD parallelism, compares two approaches to handling data during the computation (inter- and intra-register operations), and presents results of numerical experiments with a code vectorized using intrinsic functions. However, the work does not discuss the treatment of singularities in the related surface integrals, which is one of the crucial tasks of BEM computations.

The structure of the paper is as follows. In the next section we provide a model problem on which we demonstrate the boundary element workflow. In Section 3, a short overview of our BEM library is provided, Section 4 discusses the vectorization of the computationally most demanding parts of the code. Finally, we provide results of numerical experiments and conclude.

Section snippets

Boundary element method for sound-hard scattering

In this section we present the model problem under consideration, derive the corresponding boundary integral equations and their Galerkin discretization.

The BEM4I library

The solver for the wave scattering problem based on BEM is implemented in the BEM4I library [17]. The library is written using C++ in an object-oriented way. It utilizes templates to support various indexing and scalar types. OpenMP is used for the parallelization in shared memory and some parts of the code are parallelized in distributed memory by MPI.

The structure of the solver is depicted in Fig. 1, Fig. 2. Three main types of classes are responsible for the assembly of the system matrices.

  • 1.

Vectorization of the numerical quadrature

In the following section we demonstrate the applicability of the Vc library to the vectorization of the quadrature occurring in the boundary element matrices computation. The BEM4I library implements both the semi-analytic [2], [16] and numerical [1] methods, therefore the vectorized code is provided for both approaches.

The Vc library contains counterparts of most mathematical methods of the C++ Standard Library which makes it relatively easy to convert the quadrature code from a scalar to a

Numerical experiments

The following numerical experiments were carried out using one node of the Anselm cluster located at the IT4Innovations National Supercomputing Centre, Ostrava, Czech Republic. The node is equipped with two 8-core Intel Xeon E-2665 2.4 GHz processors and 64 GB of RAM. The processor supports the SSE4.2 and AVX instruction set extensions. The tests were performed using the GCC 4.9.0 compiler. Since the 256-bit registers are only supported by a subset of the first generation AVX instructions [18],

Conclusion

In this work we have presented a new library of parallel solvers based on the boundary element method. In addition to the shared memory parallelization by OpenMP, the library features explicit vectorization of the semi-analytic and numerical quadrature of boundary integrals. The numerical experiments have demonstrated a relatively high efficiency of the vectorized code for both quadrature strategies. As our early experiments with AVX did not prove a significant improvement of the computational

Acknowledgements

This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme, as well as Czech Ministry of Education, Youth and Sports via the project Large Research, Development and Innovations Infrastructures (LM2011033). The work was also supported by VŠB-TU Ostrava under the Grant SGS SP2015/160.

The

References (18)

There are more references available in the full text version of this article.

Cited by (8)

  • Solving of multi-connected curvilinear boundary value problems by the fast PIES

    2022, Computer Methods in Applied Mechanics and Engineering
    Citation Excerpt :

    It is achieved by discretization of the domain or boundary, which is implemented by the construction of a mesh. In this group one can find well-established methods, such as Finite Element Method (FEM) [1–3] and Boundary Element Method (BEM) [4–6], as well as, a modern, still being developed, FEM–BEM hybrids [7,8], Virtual Element Method (VEM) [9,10] or isogeometric analysis (IgA) [11,12]. These new methods are intended to remedy some disadvantages of the FEM or the BEM.

  • The fast parametric integral equations system in an acceleration of solving polygonal potential boundary value problems

    2020, Advances in Engineering Software
    Citation Excerpt :

    However, many researchers work on new approaches that aim is to eliminate some of the disadvantages of classical methods. The first group includes the finite element method (FEM) [1–3] (implemented in Abaqus, ANSYS, COMSOL Multiphysics and others) and the boundary element method (BEM) [4–6] (implemented, among others, in BEASY). The second group includes meshless methods [7,8], FEM-BEM hybrids [9,10], isogeometric analysis (IgA) [11,12], the virtual element method (VEM) [13,14] and still being developed parametric integral equations systems (PIES) [15].

  • Parallel and vectorized implementation of analytic evaluation of boundary integral operators

    2018, Engineering Analysis with Boundary Elements
    Citation Excerpt :

    A second option is to use wrapper libraries providing vector implementation of common mathematical functions in several vector instructions sets (including, e.g., SSE4.2, AVX2, or AVX512) resulting in a portable implementation. In [24] we describe the application of the Vc library [25] to both the semi-analytic and numerical BEM assembly. The VCL library [26] can be used in a similar fashion.

  • Boundary element quadrature schemes for multi- and many-core architectures

    2017, Computers and Mathematics with Applications
    Citation Excerpt :

    This technique leads to a portable code, since the wrapper functions are compiled to the supported vector instruction set. In [9] we describe this approach both for the semi-analytic and fully numerical integration schemes [4] using the Vc library [10]. A more user-friendly way is to use the auto-vectorization capabilities of modern compilers.

  • Research on vectorized engineering file management model

    2024, Applied Mathematics and Nonlinear Sciences
  • Openmp, multi-threaded libraries for numerical linear algebra and the fmm in an acceleration of numerical solving of the pies

    2020, Modelling and Simulation 2020 - The European Simulation and Modelling Conference, ESM 2020
View all citing articles on Scopus
View full text