GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
Introduction
In recent years, the GPU (Graphics Processing Unit), once used only for graphics processing, has been extended to general purpose computing for its high computing power and bandwidth. Some early researches adopted graphics programming languages such as Cg, OpenGL to accelerate particle algorithms [1], [2], [3]. These works showed great potential of using GPU for scientific computing. But coding for scientific computing with these languages was difficult and the application fields were also limited. However, the development of general purpose computing on GPU has never stopped. NVIDIA Corporation released their parallel computing model called CUDA (Compute Unified Device Architecture) for general purpose computing in 2007 which provides an easy-to-use tool for scientific computing and is now widely used in many fields. Many researchers have used the tool in Computational Fluid Dynamics (CFD) and obtained remarkable results of performance increasing. Thibault et al. developed a Navier–Stokes solver for incompressible flow on multi-GPU with a nd order accurate central difference scheme and achieved speedup [4]. Bailey and his co-workers used CUDA for accelerating Lattice Boltzmann Method on GPU and obtained remarkable performance enhancement [5]. Frezzotti’s group adopted semi-regular methods to solve the Boltzmann equation on GPUs with high efficiency [6]. Ran et al. realized the GPU accelerated CESE method for 1D shock tube problems and achieved high acceleration ratios [7]. Brodtkorb et al. implemented shallow water simulations on GPUs and performed a detailed analysis of it [8]. Lutsyshyn presented a scheme for the parallelization of quantum Monte Carlo method on GPU and the program was benchmarked on several models of NVIDIA GPUs [9].
Implementing CFD method on GPU greatly depends on the mesh type used. Compared with the structured counterpart, methods based on unstructured mesh cannot be efficiently accelerated by GPU because the unstructured configuration leads to the non-coalescent memory accessing on GPU. Some researchers made their efforts to overcome this difficulty. Corrigan et al. implemented an unstructured grid based Euler solver on GPU and obtained a speedup of ’s [10]. Kampolis et al. accomplished a GPU accelerated Navier–Stokes solver on unstructured grid in the same year [11] and achieved a remarkable computing performance increasing. Waltz described the performance of CHICOMA, a 3D unstructured mesh compressible flow solver, on GPU and observed speedup of over single-CPU performance [12]. Lani et al. provided a GPU-enabled finite volume solver for ideal magnetohydrodynamics on unstructured grids within the COOLFluiD platform [13]. Almost all authors employed the renumbering technique to cope the problem of non-coalescent memory accessing, which has been discussed in detail in [14]. As demonstrated in their works, with the renumbering technique, shared memory can be introduced and therefore their codes’ performance is efficiently improved.
In fact, the renumbering technique is suitable for static unstructured mesh and may fail for dynamic unstructured mesh. The dynamic unstructured mesh can be generated by adaptive mesh refinement (AMR) which is one of the most important approaches in CFD. In the method, by refining the coarse cells where truncation error is large enough, it takes much less computing resources to solve conservation equations than using fine uniform cells. However, the adaptive mesh is complicated and dynamic, which is not easy to be parallelized, especially on GPU. Wang and his team [15], together with group leading by Hsi-Yu Schive [16], have implemented solvers on a structured mesh with the AMR method. However, porting the mesh adapting part on GPU was avoided in their implementations. In the cell-based AMR method, if mesh adapting is processed on CPU, the data exchanges frequently between CPU and GPU, which will certainly introduce a bottleneck for the code’s overall performance. Thus, removing the bottleneck is significant for implementing a parallel algorithm of mesh adapting on GPU, which motivates the current work. We will attempt to implement such a solver with the cell-based AMR on GPU.
The rest of this paper is organized as follows: Section 2 will provide a brief introduction of the numerical method and the cell-based AMR method used in this work. In Section 3, the implementation of the method on GPU is described in detail. The numerical results and the solver performance on GPU will be discussed in Section 4. Finally conclusions are drawn.
Section snippets
Numerical method
Consider the two-dimensional Euler equations for an inviscid, compressible flow, given as: where and are defined as: and where and denote the density, and velocities, pressure, total energy, internal energy and specific heat, respectively.
The numerical method in this study is based on VAS2D developed by Sun and Takayama [17], [18]. An adaptive unstructured
GPU implementation
Because the data exchanging between GPU and CPU is expensive, as a golden rule, it should be avoided as far as possible. Thus, to achieve a higher computing performance, the solver is designed as totally running on GPU in this study. The main computational procedure is sketched in Fig. 2. First, in the data initialization, the program allocates a block of large size memory, reads the mesh data, initializes the flow field and carries out the initial adaption. Then the initialized data is sent to
Results and analysis
In this section, the simulation results of shock diffraction problem are presented to verify the two GPU codes (the primary code running on GT9800 and C2050, and the optimized code on C2050) and analyze their performances. All the codes are based on CUDA C and CUDA 5 and the simulation results are computed with the single-precision floating-point format.
Conclusions
The cell-based AMR on unstructured quadrilateral mesh is realized on GPU in this study. Specifically, we implemented and optimized the well-validated numerical method-VAS2D on GPU: Null memory recycling is added to improve the utilization efficiency of memory; List processing is parallelized on GPU with low frequency atomic operations. In this way, we have made one step further to realize the AMR on GPU. Our work is, to the best of our knowledge, the first unstructured cell-based algorithm that
Acknowledgment
This research was carried out with the support of the National Natural Science Foundation of China under Grants 11172292 and 11272310.
References (24)
- et al.
J. Comput. Phys.
(2007) - et al.
Comput. Phys. Comm.
(2011) - et al.
J. Comput. Phys.
(2011) - et al.
Comput. & Fluids
(2012) Comput. Phys. Comm.
(2015)- et al.
Comput. Methods Appl. Mech. Engrg.
(2010) - et al.
Comput. Phys. Comm.
(2014) - et al.
New Astron.
(2010) - et al.
J. Comput. Phys.
(1999) - et al.
J. Comput. Phys.
(2003)
J. Comput. Phys.
Vis. Comput.
Cited by (2)
Fast and accurate adaptive finite difference method for dendritic growth
2019, Computer Physics CommunicationsCitation Excerpt :Finally, extending the present work toward graphic processing units (GPUs) [30,31], the effects of the applied temperature gradient [32], fluid flows [10,33], and parallel computing [34,35] are interesting near-future research directions.
Disturbance region update method for steady compressible flows
2018, Computer Physics CommunicationsCitation Excerpt :Parallel computing on the central processing unit (CPU) and the graphics processing unit has shown significant speedup in comparison with the sequential execution time on single CPU calculations [13]. Numerous efforts were accomplished to improve computational efficiency, e.g., parallel implementation for various schemes [13,14], for different grids [15,16], and for diverse flow problems [17,18], as well as strategies to optimizing parallelization [19]. The third class, which is rather intuitive, is drawn based on reducing the number of grid cells so as to reduce the computational effort per iteration.