Algebraic multigrid employing mixed structured–unstructured data on manycore hardware
Introduction
Manycore processors, such as the Intel Xeon Phi and the Graphics Processing Units (GPUs) of Nvidia and AMD, have lowered the barrier to supercomputing, offering several teraflops of computing power in a single attached co-processor. The strong popularity of Nvidia GPUs has led to the wide adoption of Nvidia's Compute Unified Device Architecture (CUDA) parallel computing architecture. While CUDA is conceptually simple, in practice it can be challenging to obtain good performance from a GPU. This is due to the need to expose significant parallelism to keep thousands of threads active, and to minimize the overhead of communication between CPU and GPU. Optimum performance requires hybrid programming, where GPUs and CPUs execute tasks concurrently.
Manycore programming is consequential for developments in Computational Fluid Dynamics (CFD), since most general purpose solvers now in use in industry and academia had their genesis in the mid 90s or early 2000s, well before the present trend toward manycore processing could be anticipated. While GPUs have been enthusiastically adopted by the CFD research community [37], more research is needed to optimize applications for heterogeneous CPU/GPU resources and to ensure performance portability as processor technology evolves.
Basic research in integrating CFD solutions with manycore architectures is new, with most research appearing in the last several years. This contributes to a growing body of research on improving multigrid solver performance for single GPU and multi-GPU environments [28], [32], [14], [36], exploring solution speed improvement in older CFD applications that have been upgraded to include GPUs [12], [8], [18], [6], and research in how to optimize memory (and cache) usage to improve performance [5], [13], [34], which increasingly is including diverse architecture types [36], [22], [11]. Issues related to deploying structured and unstructured data over manycore resources are also being explored [22], [7]. Note that these studies generally involve designing OpenMP or MPI level parallelism with CUDA (and increasingly OpenACC) to access multiple GPUs [33], [15]. More recently, research is beginning to include new architectures such as those based on the Intel Xeon Phi [36] co-processor.
The focus of this paper has been on the introduction of a data structure for a hybrid structured/unstructured data for CFD problems that incorporate an algebraic multigrid method. While not the subject of this study the hybrid structure is also designed to isolate single precision from double precision mesh regions, and to solve different regions of the mesh on either CPUs or GPUs. Both of these aspects are integrated into the multigrid solver.
The particular algebraic multigrid that is chosen in this work to show the data structure flexibility is the Additive Correction Multigrid (ACM)[16].
There are presently several examples of GPU accelerated CFD codes specifically for either structured data (General Electric TACOMA, Turbostream, etc.) or unstructured data (ANSYS Fluent, Rolls-Royce HYDRA, etc.). The hybrid structured/unstructured code design is the subject of this work. Nvidia AMGX API which contains GPU implementation of algebraic multigrid method (both classical-based and aggregation-based), multiple pre-conditioners and Krylov subspace solvers is employed by the ANSYS Fluent CFD software and 2×-5× speedups has been reported in [26] for different test problems with unstructured data. In [35] acceleration of the General Electric in-house TACOMA code with GPUs for structured data reports 2×-3× speedups. In [2], [4] Bell et al. study the acceleration of different algebraic multigrid methods e.g. aggregation, interpolation (i.e. prolongation/restriction), Galerkin product etc. on GPU devices by introducing algorithms that expose fine grain parallelism. In [9] authors studied fine-grain parallelization of the agglomeration stage of the setup for an unsmooth aggregation-based AMG and further improved the robostness of the solver by using K-cycles instead of the common V-cycles. They showed that the performance of this unsmooth aggregation-based AMG outperforms the classical-based AMG in larger scale problems.
In this work authors are interested in the performance of the multigrid solver to the extent that involves the underlying data structure. So the specific choice of the smoothers, restriction, prologation or aggregation techniques are not central to reach the conclusions. Consequently a simple implementation of the ACM algebraic multigrid with simple GS or Jacobi smoothers are employed. The sparse matrix storage in this work is a variant of CSR (Compressed Sparse Row) for unstructured regions. But structured regions are stored in a true structured format without a need for any auxiliary tables in all multigrid levels. This approach is superior to the DIA (Diagonal) storage format and permits reaching higher memory bandwidths in the structured grids. The common storage format for hybrid grids is HYB (Hybrid ELL-COO) which shows good performance for simple test cases and may approach the performance of our CSR combined with true structured implementation. In the present study there is nothing that would preclude the use of the HYB data storage for the unstructured cell regions. A review of the performance of different storage formats are presented in [19], [3].
In the trend towards high performance computing on many-core architectures and more broadly speaking heterogeneous computational resources, there has been challenges to get the most benefit from the resources. These challenges have been mostly regarding the design of the computational tasks and their compatibility with the new architectures. For instance simply porting old legacy codes to GPU programming language result in less than desired performance benefits. In [27] authors have indicated best practices and recommendations to get a better performance from the GPU devices. It is shown that
- •
coalesced memory access patterns
- •
use of single precision instead of double precision variables if the single precision computation fulfills the precision needs and
- •
masking data transfer between host and device.
This paper focuses on the integration of the hybrid data kind (i.e. structured/unstructured data) with the cell/interface software design. Structured data are very suitable for GPU computation since it maximizes coalesced global device memory access by many CUDA threads. On the other hand generation of structured data/grids is challenging if not impossible for many problems with complicated geometries. Unstructured data mesh generation is a common practice for complicated geometries, however these can also be handled with hybrid structured/unstructured meshing if the solver is capable of processing the different data areas. The present work emphasizes the improved performance (when deployed on GPUs) if a CFD solution is designed to isolate structured and unstructured data regions fully integrated within a multigrid solution.
To maximize GPU acceleration it is important to increase use of structured data and resort to unstructured data only where it is necessary. Following this approach requires a design of a flexible data structure that permits seamless transition between structured and unstructured data blocks at the interfaces and within the algebraic multigrid solver. So in this work a data structure that allows this methodology is presented with details. The benefits of using this true hybrid structured/unstructured approach that allows maximal use of structured data are shown with two submarine incident flow problem examples.
Section snippets
Manycore CFD software design
In the present study EXN/Aero CFD solver is employed which has been designed from the outset to accommodate changes towards manycore computing. The software presently utilizes MPI, OpenMP and CUDA implementations for parallelization and solves the Reynolds-Averaged Navier-Stokes (RANS) or Large Eddy Simulations (LES) turbulent fluid flow equations across multiple GPUs and CPUs.
EXN/Aero is equipped with the parareal algorithm [20] to achieve temporal parallelization where coarse grain and fine
Additive correction multigrid
In this section Additive Correction Multigrid method is discussed in detail. A data structure for unstructured meshes that is compatible with multigrid methods and that is suitable for mixed data solution is presented. Also since solution of hybrid problems with mixed structured and unstructured mesh blocks on GPU architectures is being sought, interface operations between neighboring blocks becomes non-trivial. For this reason Section 3.2 discusses the interface design that is incorporated in
Multigrid performance tests
This subsection is divided into two parts. In the first part the performance of the proposed unstructured mesh data structure i.e. mds_t is assessed and in the second part the performance of the amg_t data structure for the unstructured mesh is tested and compared against the performance of the structured ACM implementation. These performance tests are repeated on GPU as well to illustrate the behavior of the unstructured mesh against the structured counterpart on the manycore architectures. A
Governing equations and results of a case study
In the next subsection the governing equations of fluid dynamic problems is briefly reviewed which is then followed by a subsection explaining the tests that has been performed to demonstrate the benefits of using mixed data when solving problems of practical interest on GPU accelerated environments.
Conclusion
The geometry of practical fluid dynamic problems is most likely very complicated and while employing unstructured computational grids might be convenient, it might not be the most efficient method. Hence it is important to maximize the use of structured subdomains in the grid as much as possible to benefit from the fast solution of their memory aligned data. On the other hand performance benefits of structured grids over unstructured grids is more pronounced when using GPU co-processors since
Acknowledgement
This work has been supported by the Natural Science and Engineering Research Council (NSERC) of Canada Discovery Grant of the second author.
Araz Eghbal is currently a PhD candidate in Mechanical Engineering at University of New Brunswick (Fredericton-Canada). He obtained his BSc in Physics from Amirkabir University of Technology (Tehran-Iran) in 2008 and MSc in Gravitation and Cosmology from Shahid Beheshti University (Tehran-Iran) in 2011.
References (38)
- et al.
A performance study of general-purpose applications on graphics processors using CUDA
J. Parallel Distrib. Comput.
(2008) - et al.
Large calculation of the flow over a hypersonic vehicle using a GPU
J. Comput. Phys.
(December 2008) - et al.
A GPU accelerated aggregation algebraic multigrid method
Comput. Math. Appl.
(2014) - et al.
Energy efficiency vs performance of the numerical solution of PDES: an application study on a low-power arm-based cluster
J. Comput. Phys.
(2013) - et al.
Parallel preconditioned conjugate gradient algorithm on GPU
J. Comput. Appl. Math.
(2012) - et al.
CFD-based analysis and two-level aerodynamic optimization on graphics processing units
Comput. Methods Appl. Mech. Eng.
(2010) - et al.
Résolution d’edp par un schéma en temps pararéel
Comptes Rendus de l’Académie des Sciences-Series I-Mathematics
(2001) - et al.
Parallelization of an unstructured Navier-Stokes solver using a multi-color ordering method for OpenMP
Comput. Fluids
(2013) - et al.
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Comput.
(2010) - et al.
A User's Guide to CGNS
(2001)
Exposing fine-grained parallelism in algebraic multigrid methods
SIAM J. Sci. Comput.
Efficient sparse matrix-vector multiplication on CUDA. Technical report
Semi-automatic porting of a large-scale Fortran CFD code to GPUS
Int. J. Numer. Methods Fluids
Running unstructured grid-based CFD solvers on modern graphics hardware
Int. J. Numer. Methods Fluids
Benchmarking of a massively parallel hybrid CFD solver for ocean applications
Using GPUS to improve multigrid solver performance on a cluster
Int. J. Comput. Sci. Eng.
Optimization of a computational fluid dynamics code for the memory hierarchy: a case study
Int. J. High Perform. Comput. Appl.
Accelerating hydrocodes with OpenACC, OpeCL and CUDA
Cited by (0)
Araz Eghbal is currently a PhD candidate in Mechanical Engineering at University of New Brunswick (Fredericton-Canada). He obtained his BSc in Physics from Amirkabir University of Technology (Tehran-Iran) in 2008 and MSc in Gravitation and Cosmology from Shahid Beheshti University (Tehran-Iran) in 2011.
Dr. Andrew G. Gerber is professor and chair at the department of Mechanical Engineering at the University of New Brunswick. His research interests are Computational Fluid Dynamics, Thermodynamics and Multiphase Flows.
Dr. Eric Aubanel is professor at the Faculty of Computer Science at University of New Brunswick. His area of research is High Performance Computing. He is a part of IBM Center for Advanced Studies-Atlantic and is associated with the Atlantic Computational Excellence Network (ACEnet).