Elsevier

Computer Physics Communications

Volume 207, October 2016, Pages 114-122
Computer Physics Communications

GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid

https://doi.org/10.1016/j.cpc.2016.05.018Get rights and content

Abstract

A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.

Introduction

In recent years, the GPU (Graphics Processing Unit), once used only for graphics processing, has been extended to general purpose computing for its high computing power and bandwidth. Some early researches adopted graphics programming languages such as Cg, OpenGL to accelerate particle algorithms  [1], [2], [3]. These works showed great potential of using GPU for scientific computing. But coding for scientific computing with these languages was difficult and the application fields were also limited. However, the development of general purpose computing on GPU has never stopped. NVIDIA Corporation released their parallel computing model called CUDA (Compute Unified Device Architecture) for general purpose computing in 2007 which provides an easy-to-use tool for scientific computing and is now widely used in many fields. Many researchers have used the tool in Computational Fluid Dynamics (CFD) and obtained remarkable results of performance increasing. Thibault et al. developed a Navier–Stokes solver for incompressible flow on multi-GPU with a 2nd order accurate central difference scheme and achieved 100× speedup [4]. Bailey and his co-workers used CUDA for accelerating Lattice Boltzmann Method on GPU and obtained remarkable performance enhancement  [5]. Frezzotti’s group adopted semi-regular methods to solve the Boltzmann equation on GPUs with high efficiency  [6]. Ran et al. realized the GPU accelerated CESE method for 1D shock tube problems and achieved high acceleration ratios  [7]. Brodtkorb et al. implemented shallow water simulations on GPUs and performed a detailed analysis of it [8]. Lutsyshyn presented a scheme for the parallelization of quantum Monte Carlo method on GPU and the program was benchmarked on several models of NVIDIA GPUs [9].

Implementing CFD method on GPU greatly depends on the mesh type used. Compared with the structured counterpart, methods based on unstructured mesh cannot be efficiently accelerated by GPU because the unstructured configuration leads to the non-coalescent memory accessing on GPU. Some researchers made their efforts to overcome this difficulty. Corrigan et al. implemented an unstructured grid based Euler solver on GPU and obtained a speedup of 33×’s [10]. Kampolis et al. accomplished a GPU accelerated Navier–Stokes solver on unstructured grid in the same year  [11] and achieved a remarkable computing performance increasing. Waltz described the performance of CHICOMA, a 3D unstructured mesh compressible flow solver, on GPU and observed speedup of 4–5× over single-CPU performance  [12]. Lani et al. provided a GPU-enabled finite volume solver for ideal magnetohydrodynamics on unstructured grids within the COOLFluiD platform  [13]. Almost all authors employed the renumbering technique to cope the problem of non-coalescent memory accessing, which has been discussed in detail in  [14]. As demonstrated in their works, with the renumbering technique, shared memory can be introduced and therefore their codes’ performance is efficiently improved.

In fact, the renumbering technique is suitable for static unstructured mesh and may fail for dynamic unstructured mesh. The dynamic unstructured mesh can be generated by adaptive mesh refinement (AMR) which is one of the most important approaches in CFD. In the method, by refining the coarse cells where truncation error is large enough, it takes much less computing resources to solve conservation equations than using fine uniform cells. However, the adaptive mesh is complicated and dynamic, which is not easy to be parallelized, especially on GPU. Wang and his team [15], together with group leading by Hsi-Yu Schive  [16], have implemented solvers on a structured mesh with the AMR method. However, porting the mesh adapting part on GPU was avoided in their implementations. In the cell-based AMR method, if mesh adapting is processed on CPU, the data exchanges frequently between CPU and GPU, which will certainly introduce a bottleneck for the code’s overall performance. Thus, removing the bottleneck is significant for implementing a parallel algorithm of mesh adapting on GPU, which motivates the current work. We will attempt to implement such a solver with the cell-based AMR on GPU.

The rest of this paper is organized as follows: Section  2 will provide a brief introduction of the numerical method and the cell-based AMR method used in this work. In Section  3, the implementation of the method on GPU is described in detail. The numerical results and the solver performance on GPU will be discussed in Section  4. Finally conclusions are drawn.

Section snippets

Numerical method

Consider the two-dimensional Euler equations for an inviscid, compressible flow, given as: Ut+F(U)x+G(U)y=0 where U,F and G are defined as: U=[ρρuρvE],F=[ρuρu2+pρuvu(E+p)],G=[ρvρuvρv2+pv(E+p)], and E=ρ[12(u2+v2)+e],e=pρ(γ1), where ρ,u,v,p,E,e and γ denote the density, x and y velocities, pressure, total energy, internal energy and specific heat, respectively.

The numerical method in this study is based on VAS2D developed by Sun and Takayama  [17], [18]. An adaptive unstructured

GPU implementation

Because the data exchanging between GPU and CPU is expensive, as a golden rule, it should be avoided as far as possible. Thus, to achieve a higher computing performance, the solver is designed as totally running on GPU in this study. The main computational procedure is sketched in Fig. 2. First, in the data initialization, the program allocates a block of large size memory, reads the mesh data, initializes the flow field and carries out the initial adaption. Then the initialized data is sent to

Results and analysis

In this section, the simulation results of shock diffraction problem are presented to verify the two GPU codes (the primary code running on GT9800 and C2050, and the optimized code on C2050) and analyze their performances. All the codes are based on CUDA C and CUDA 5 and the simulation results are computed with the single-precision floating-point format.

Conclusions

The cell-based AMR on unstructured quadrilateral mesh is realized on GPU in this study. Specifically, we implemented and optimized the well-validated numerical method-VAS2D on GPU: Null memory recycling is added to improve the utilization efficiency of memory; List processing is parallelized on GPU with low frequency atomic operations. In this way, we have made one step further to realize the AMR on GPU. Our work is, to the best of our knowledge, the first unstructured cell-based algorithm that

Acknowledgment

This research was carried out with the support of the National Natural Science Foundation of China under Grants 11172292 and 11272310.

References (24)

  • J. Yang et al.

    J. Comput. Phys.

    (2007)
  • A. Frezzotti et al.

    Comput. Phys. Comm.

    (2011)
  • W. Ran et al.

    J. Comput. Phys.

    (2011)
  • A.R. Brodtkorb et al.

    Comput. & Fluids

    (2012)
  • Y. Lutsyshyn

    Comput. Phys. Comm.

    (2015)
  • I. Kampolis et al.

    Comput. Methods Appl. Mech. Engrg.

    (2010)
  • A. Lani et al.

    Comput. Phys. Comm.

    (2014)
  • P. Wang et al.

    New Astron.

    (2010)
  • M. Sun et al.

    J. Comput. Phys.

    (1999)
  • M. Sun et al.

    J. Comput. Phys.

    (2003)
  • G.V. Nivarti et al.

    J. Comput. Phys.

    (2015)
  • W. Li et al.

    Vis. Comput.

    (2003)
  • Cited by (2)

    • Fast and accurate adaptive finite difference method for dendritic growth

      2019, Computer Physics Communications
      Citation Excerpt :

      Finally, extending the present work toward graphic processing units (GPUs) [30,31], the effects of the applied temperature gradient [32], fluid flows [10,33], and parallel computing [34,35] are interesting near-future research directions.

    • Disturbance region update method for steady compressible flows

      2018, Computer Physics Communications
      Citation Excerpt :

      Parallel computing on the central processing unit (CPU) and the graphics processing unit has shown significant speedup in comparison with the sequential execution time on single CPU calculations [13]. Numerous efforts were accomplished to improve computational efficiency, e.g., parallel implementation for various schemes [13,14], for different grids [15,16], and for diverse flow problems [17,18], as well as strategies to optimizing parallelization [19]. The third class, which is rather intuitive, is drawn based on reducing the number of grid cells so as to reduce the computational effort per iteration.

    View full text