Elsevier

Computer Physics Communications

Volume 244, November 2019, Pages 13-24
Computer Physics Communications

Implementation of the moving particle semi-implicit method for free-surface flows on GPU clusters

https://doi.org/10.1016/j.cpc.2019.07.010Get rights and content

Abstract

The moving particle semi-implicit (MPS) method performs well in simulating incompressible flows with free surfaces. Despite its applicability, the MPS method suffers from the fundamental instability problem and high computational cost in its practical application. Substantial research has been conducted on improving the stability and accuracy of the MPS method. Moreover, graphics processing units (GPUs), which are multi-processors that execute many three-dimensional geometric processes at high speed, provide unprecedented capability for scientific computations. However, the usage of a single GPU card is not sufficient for engineering applications that require several million particles that predict the desired physical processes, because the available memory space is insufficient. In this work, the dynamic stability (DS) algorithm and particle shifting (PS) algorithm have been used to overcome the instability and inaccuracies caused by tensile instability and non-uniform particle distribution, respectively. Based on the stable MPS method, a GPU-based MPS code that uses the compute unified device architecture (CUDA) language has been developed. An efficient neighborhood particle search is performed using an indirect method, and the matrix for the pressure Poisson equation (PPE) is assembled in parallel. Based on the single-GPU version, a multi-GPU MPS code has been developed. The approach uses a non-geometric dynamic domain decomposition method that provides homogeneous load balancing whereby different portions (subdomains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of a message passing interface (MPI). Based on the neighborhood particle search, the techniques for building and updating the “halo” are described in detail. The speed-up of the single-GPU version is analyzed for different numbers of particles, and the scalability of the multi-GPU version is analyzed for different numbers of particles and different numbers of GPUs. Last, an application with more than 107 particles is presented to show the capability of the code in handling large-scale simulations.

Introduction

The moving particle semi-implicit (MPS) method was proposed by Koshizuka [1] for incompressible flows. As a fully Lagrangian particle-based method, it has several advantages in the modeling of free-surface flows with large deformation, fragmentation and merging; solids with complex geometry; multiphase flows; and multi-physics problems. In free surface flows, the MPS method has a high reproducibility of a complicated behavior of water surface change with a fragmentation and coalescence of water. In multiphase flows, MPS method can capture or track the interphase easily, because the interphase is always clearly represented in calculation.

However, the traditional MPS method has a key fundamental instability problem that can reduce the accuracy of the solution considerably and even lead to failure of the simulation. One reason for the instability is the tensile instability [2]. This problem is also very common in the SPH method and has been widely studied by many researchers (e.g., [2], [3], [4]). To avoid tensile instability, Koshizuka et al. [5] modified the original MPS gradient model so that the pressure interaction forces would be purely repulsive, and Khayyer et al. [6] proposed the CMPS method. However, these approaches generally incorporate an artificial repulsive force that vanishes for a perfectly symmetric distribution of neighboring particles. Khayyer et al. [7] noted that this artificial repulsive force is predominant in these equations, rather than the original gradient, and proposed a dynamic stabilization (DS) algorithm for overcoming the tensile instability.

The non-uniform arrangement of particles, along with the long simulation time, is another potential cause of instability, as the traditional differential operator models are formulated under the assumption that particles are arranged uniformly. To overcome this, Khayyer and Gotoh [8] altered the Laplacian model in 2D and further extended it to 3D domains [9]. Tanaka and Masunaga [10] proposed a new formulation of the source term in the PPE for pressure. For regularization of anisotropic particle distributions, a new method was proposed by Xu et al. [11], which involves slightly particle shifting (PS) across streamlines, thereby avoiding the extreme stretching and bunching of particles. Then, the hydrodynamic variables are adjusted through interpolation. A generalized version of this algorithm has been proposed by Lind et al. [12]. Improvements in the discretization schemes, for instance, MLS-MPS [13], [14], have also been proposed recently. In the present study, we focus on improving the stability and accuracy of pressure calculations using the traditional discretization schemes. These improvements can be straightforwardly applied together with newly developed discretization schemes.

Surface detection is important for simulations of flows with free surfaces. The traditional MPS method uses a density threshold to differentiate the free surface particles. Tanaka and Masunaga [15] used the number of neighboring particles instead of the particle number density (PND) for surface detection. However, these density-based detection methods are not accurate in simulations. Geometry-based detection methods, such as the arc method, which was proposed by Diltes et al. [16], are considered to be more reliable. Tamai and Koshizuka [14] proposed a three-step surface detection method based on geometry and density. Shibata et al. [17] developed a new and accurate surface detection method that uses a visual light source and a virtual screen. This simple method performs well in detecting the surface particles in their simulations of hydrostatic pressure problem and dam breaking problem [17].

Another limitation in the practical use of the particle method is related to the high computational load. Various types of acceleration techniques have been employed to overcome this limitation. On the one hand, there are the traditional high-performance computing (HPC) techniques, which involve the use of hundreds or thousands of computing nodes, each of which hosts one or more central processing units (CPUs). Gotoh et al. [18] parallelized the three-dimensional corrected MPS method for general applications, and a simple dynamic domain decomposition was applied for optimized load balancing among the processors. In Ref [19], the MPS method was parallelized based on particle decomposition, and two techniques, namely, renumbering and a communication list, were used to improve the communication speed. Fernandes et al. [20] proposed a strategy for parallelizing the MPS method for fully distributed computing in a cluster. The strategy was based on a non-geometric dynamic domain decomposition method and enabled the simulation of models with hundreds of millions of particles while keeping the required runtime within reasonable limits. The absence of a linear system to be solved gives more liberty to the design of a parallelization strategy and allows more calculations in the explicit MPS method. Murotani et al. [21] used a two-level voxel-based domain decomposition to parallelize the explicit MPS method for CPU clusters, and tsunami simulations with models of up to 390 million particles were carried out. The above results show the effectiveness of HPC in large-scale computation of particle method. However, the main drawback of this type of approach is the enormous number of cores needed, which requires considerable investment, including the purchase, maintenance and power supply of this type of equipment.

On the other hand, graphics processing units (GPUs), which are new architectures for massively parallel processing and computer clusters, recently became widespread. Although they were developed exclusively for graphical purposes and driven by the computer game industry, the development of GPU technology is leading to the development of general-purpose GPUs (GPGPUs) because floating-point arithmetic by GPUs became possible and the compute unified device architecture (CUDA), along with its software development kit (SDK) has been released. For example, despite the known challenges in speeding up linear system solutions using GPUs, Hori et al. [22] developed a GPU-accelerated version of the standard MPS code. A two-dimensional calculation of elliptical drop evolution with 100k particles showed that the GPU-accelerated version was approximately 7 times faster than the code for a CPU using only one core. Zhu et al. [23] developed a GPU-based MPS model with four optimization levels. The efficient neighborhood particle search is performed using an indirect method. A simulation of a benchmark problem of dam-breaking flow showed that the GPU-based MPS model outperforms the traditional CPU model by 26 times. Kakuda et al. [24], [25] presented a GPU-based MPS method for calculating 2-D and 3-D dam break problems. Compared with CPU simulations, the speed-ups are 12 times and 17 times, respectively. Gou et al. [26] simulated the isothermal multi-phase fuel-coolant interaction by using the MPS method with GPU acceleration. These results show that with the help of GPUs, speedups about one orders of magnitude could be demonstrated. Nevertheless, the use of a single-GPU card is not sufficient for engineering applications that require several million particles to predict the desired physical processes, as the available memory space is insufficient. It is essential to harness the performance of multiple GPUs for large simulations.

Substantial effort has been expended on fully explicit mesh-free methods on multi-GPUs. Based on multi-GPUs, a parallelization scheme for explicit MPS simulation was presented by [27]. In that work, the physical system was divided into fixed subdomains based on a static partitioning, and the linear system for solving the pressure Poisson equation (PPE) was replaced by an equation of state [28] commonly used in the smoothed particle hydrodynamics (SPH) method [29]. Then, the method becomes fully explicit and well suited for computation on GPUs. Hashimoto et al. [30], [31] presented multi-GPU simulations using a GPU supercomputer for a realistic engineering problem. For SPH, starting from the GPU version of the SPH code DualSPHysics, Valdez-Balderas et al. [32] developed a multi-GPU SPH code for free-surface flows. The approach was based on a spatial decomposition technique. An improved message passing interface (MPI) implementation, including dynamic load balancing and overlapping data communication and computation tasks, was applied in [33], which utilizes MPI and CUDA to combine the power of different devices, thereby enabling the execution of SPH on heterogeneous clusters. In contrast, semi-implicit particle-based methods, such as the MPS and incompressible SPH (ISPH) methods, appear to be more stable than the fully explicit ones, thereby enabling relatively stable/accurate calculations [34], [35]. However, it is difficult to implement a multi-GPU MPS or ISPH calculation due to their semi-implicit algorithms. To the best of the authors’ knowledge, no study on multi-GPU application to the semi-implicit MPS method has been reported.

In this work, the DS algorithm and the particle shifting (PS) algorithm are used to overcome the instability and inaccuracies caused by tensile instability and non-uniform particle distribution, respectively, and the light source method is used in surface detection. To overcome the high computational cost, this work presents a multi-GPU MPS application. First, at a fine level, a single-GPU version of the MPS method code is developed. Based on the single-GPU version, a multi-GPU MPS code is developed at a coarse level by decomposing the physical system and assigning different subdomains to different GPUs. Communication between devices is achieved with the use of MPI. With this development, we expect to be able to simulate many particles and expand the possibilities for applications of the MPS method in engineering analysis as in most of the relevant phenomena involving free-surface flows, scales that ranges from hundreds of meters to centimeters are necessary to describe accurately the coastal hydrodynamics. The spatial scales may be 4–5 orders of magnitude in coastal engineering, however, simultaneously tackling a large-scale simulation domain, high resolution, and a high level of incompressibility presents a computational challenge [36].

Section snippets

Governing equations

The governing equations for incompressible flows are mass and momentum conservation equations: ρt+(ρu)=0, DuDt=pρ+υ2u+g,where t is the time, ρ the fluid density, u the velocity vector, p the pressure, υ the kinematic viscosity, and g the gravitational acceleration.

Derivative models

In the MPS method, particle interactions are evaluated using a kernel function w, which is a function of the distance |rjri| between particles and the effective radius re of the support domain: w(|rjri|,re)=re(|rjri|)1(|rj

Single-GPU implementation of the MPS method

Multi-GPU using MPI

As mentioned above, the number of particles that can be simulated is limited due to the size of the GPU memory. To exceed the limit, a second level of parallelization based on the non-geometric dynamic domain decomposition method is applied by using MPI, which is explained in this section. And Fig. 6 shows the computational flow chart of the multi-GPU MPS method.

Computational environment

Simulations in this work are carried out in a single node hosting six GPUs at Zhejiang university, China. The GPUs are NVIDIA Telsa K80, which features a Kepler architecture and 2496 computing cores. These cores have a clock rate of 0.82 GHz and a memory of 12 GB Global memory. The connection to the host is done via a PCI-Express bus. Additionally, the hosting node has two Intel Xeon E5-2650v3 processors with 10 cores each (2.30 GHz) which are connected by Infiniband FDR 4x (56 Gbit/s and a

Conclusions and future work

In this work, The light source method is used to detect the free surface precisely.

A parallel implementation for the modified MPS method has been developed. The approach uses the CUDA and MPI programming languages to combine the parallel performances of several GPUs in a host machine or in multiple machines connected by a network.

The single-GPU implementation presented here has shown obvious acceleration. The speed-up of the explicit part in the calculation is more than 40 times. Due to the

References (40)

  • KhayyerA. et al.

    J. Comput. Phys.

    (2011)
  • TsurutaN. et al.

    Comput. & Fluids

    (2013)
  • KhayyerA. et al.

    Appl. Ocean Res.

    (2010)
  • KhayyerA. et al.

    Appl. Ocean Res.

    (2012)
  • LindS.J. et al.

    J. Comput. Phys.

    (2012)
  • TanakaM. et al.

    J. Comput. Phys.

    (2010)
  • HoriC. et al.

    Comput. Fluids

    (2011)
  • MonaghanJ.J.

    J. Comput. Phys.

    (1994)
  • HashimotoH. et al.

    Ocean Eng.

    (2017)
  • Valdez-BalderasD. et al.

    J. Parallel Distrib. Comput.

    (2013)
  • DomínguezJ.M. et al.

    Comput. Phys. Comm.

    (2013)
  • GotohH. et al.

    Coast. Eng.

    (2006)
  • KleefsmanK.M.T. et al.

    J. Comput. Phys.

    (2005)
  • KoshizukaS.

    Comput. Fluid Dyn. J.

    (1995)
  • J.W. Swegle, D.L. Hicks, S.W. Attaway, Academic Press Professional, Inc., 1995, pp....
  • BonetJ. et al.

    Internat. J. Numer. Methods Engrg.

    (2010)
  • KoshizukaS. et al.

    Internat. J. Numer. Methods Fluids

    (2015)
  • KhayyerA. et al.

    Coast. Eng. J.

    (2008)
  • KondoM. et al.

    Internat. J. Numer. Methods Fluids

    (2011)
  • R. Xu, P. Stansby, D. Laurence, Academic Press Professional, Inc., 2009, pp....
  • Cited by (10)

    • Particle methods in ocean and coastal engineering

      2021, Applied Ocean Research
      Citation Excerpt :

      Chen and Wan (2019a) developed a single-GPU accelerated MPS code, which was demonstrated to achieve a speedup of 33.5 times against single-CPU simulations. The first multi-GPU code of projection-based particle methods (MPS here) was reported in Gou et al. (2019). Based on simulation systems of a maximum of 1.2 million particles, six GPUs achieved an efficiency of 73.68% and a speedup of 4.42 times compared to a single GPU.

    View all citing articles on Scopus
    View full text