Algebraic multigrid employing mixed structured–unstructured data on manycore hardware

https://doi.org/10.1016/j.jocs.2016.07.004Get rights and content

Highlights

  • A data structure for multigrid solution of mixed data problems is explained.

  • Advantage of employing structured data for solving problems in GPGPUs is shown.

  • Unstructured data use is minimized to obtain maximum performance from GPGPUs.

Abstract

This paper outlines investigations in computing performance when deploying a Computational Fluid Dynamics solution on CPU and GPU resources. Critical to the performance is an algebraic multigrid solver that concurrently utilizes mixed structured and unstructured data for solution. Software organization for manycore computing and data storage patterns for efficient memory access is described in detail along with performance testing on practical flow cases. It is shown that structured data blocks of sizes greater than 1 million, are solved more than 25× faster on a GPU compared to when solved using a single CPU thread. On the other hand unstructured data blocks does not reach more than 10× speedup for the same comparison. Consequently maximizing use of structured data blocks in a mixed data configuration allows a more efficient utilization of GPU acceleration while still benefiting from flexibility of unstructured blocks for grid generation purposes. Speedup obtained for mixed data block problems varies depending on the block configurations where an average 3× speedup is reported for a (76% structured, 24% unstructured) submarine incident flow problem in comparison with a same fully unstructured problem.

Introduction

Manycore processors, such as the Intel Xeon Phi and the Graphics Processing Units (GPUs) of Nvidia and AMD, have lowered the barrier to supercomputing, offering several teraflops of computing power in a single attached co-processor. The strong popularity of Nvidia GPUs has led to the wide adoption of Nvidia's Compute Unified Device Architecture (CUDA) parallel computing architecture. While CUDA is conceptually simple, in practice it can be challenging to obtain good performance from a GPU. This is due to the need to expose significant parallelism to keep thousands of threads active, and to minimize the overhead of communication between CPU and GPU. Optimum performance requires hybrid programming, where GPUs and CPUs execute tasks concurrently.

Manycore programming is consequential for developments in Computational Fluid Dynamics (CFD), since most general purpose solvers now in use in industry and academia had their genesis in the mid 90s or early 2000s, well before the present trend toward manycore processing could be anticipated. While GPUs have been enthusiastically adopted by the CFD research community [37], more research is needed to optimize applications for heterogeneous CPU/GPU resources and to ensure performance portability as processor technology evolves.

Basic research in integrating CFD solutions with manycore architectures is new, with most research appearing in the last several years. This contributes to a growing body of research on improving multigrid solver performance for single GPU and multi-GPU environments [28], [32], [14], [36], exploring solution speed improvement in older CFD applications that have been upgraded to include GPUs [12], [8], [18], [6], and research in how to optimize memory (and cache) usage to improve performance [5], [13], [34], which increasingly is including diverse architecture types [36], [22], [11]. Issues related to deploying structured and unstructured data over manycore resources are also being explored [22], [7]. Note that these studies generally involve designing OpenMP or MPI level parallelism with CUDA (and increasingly OpenACC) to access multiple GPUs [33], [15]. More recently, research is beginning to include new architectures such as those based on the Intel Xeon Phi [36] co-processor.

The focus of this paper has been on the introduction of a data structure for a hybrid structured/unstructured data for CFD problems that incorporate an algebraic multigrid method. While not the subject of this study the hybrid structure is also designed to isolate single precision from double precision mesh regions, and to solve different regions of the mesh on either CPUs or GPUs. Both of these aspects are integrated into the multigrid solver.

The particular algebraic multigrid that is chosen in this work to show the data structure flexibility is the Additive Correction Multigrid (ACM)[16].

There are presently several examples of GPU accelerated CFD codes specifically for either structured data (General Electric TACOMA, Turbostream, etc.) or unstructured data (ANSYS Fluent, Rolls-Royce HYDRA, etc.). The hybrid structured/unstructured code design is the subject of this work. Nvidia AMGX API which contains GPU implementation of algebraic multigrid method (both classical-based and aggregation-based), multiple pre-conditioners and Krylov subspace solvers is employed by the ANSYS Fluent CFD software and 2×-5× speedups has been reported in [26] for different test problems with unstructured data. In [35] acceleration of the General Electric in-house TACOMA code with GPUs for structured data reports 2×-3× speedups. In [2], [4] Bell et al. study the acceleration of different algebraic multigrid methods e.g. aggregation, interpolation (i.e. prolongation/restriction), Galerkin product etc. on GPU devices by introducing algorithms that expose fine grain parallelism. In [9] authors studied fine-grain parallelization of the agglomeration stage of the setup for an unsmooth aggregation-based AMG and further improved the robostness of the solver by using K-cycles instead of the common V-cycles. They showed that the performance of this unsmooth aggregation-based AMG outperforms the classical-based AMG in larger scale problems.

In this work authors are interested in the performance of the multigrid solver to the extent that involves the underlying data structure. So the specific choice of the smoothers, restriction, prologation or aggregation techniques are not central to reach the conclusions. Consequently a simple implementation of the ACM algebraic multigrid with simple GS or Jacobi smoothers are employed. The sparse matrix storage in this work is a variant of CSR (Compressed Sparse Row) for unstructured regions. But structured regions are stored in a true structured format without a need for any auxiliary tables in all multigrid levels. This approach is superior to the DIA (Diagonal) storage format and permits reaching higher memory bandwidths in the structured grids. The common storage format for hybrid grids is HYB (Hybrid ELL-COO) which shows good performance for simple test cases and may approach the performance of our CSR combined with true structured implementation. In the present study there is nothing that would preclude the use of the HYB data storage for the unstructured cell regions. A review of the performance of different storage formats are presented in [19], [3].

In the trend towards high performance computing on many-core architectures and more broadly speaking heterogeneous computational resources, there has been challenges to get the most benefit from the resources. These challenges have been mostly regarding the design of the computational tasks and their compatibility with the new architectures. For instance simply porting old legacy codes to GPU programming language result in less than desired performance benefits. In [27] authors have indicated best practices and recommendations to get a better performance from the GPU devices. It is shown that

  • coalesced memory access patterns

  • use of single precision instead of double precision variables if the single precision computation fulfills the precision needs and

  • masking data transfer between host and device.

each will introduce a significant multiplier to the speedup obtained from employing GPU accelerators. The cell/interface design [10] used in this work allows containment of the different data types (single/double) and kinds (structured/unstructured) within different regions of a specific problem domain based on the provisions of the expert users and the physics of the problem. Also hybrid CPU/GPU load balancing would allow hide part of the data transfer and communication costs.

This paper focuses on the integration of the hybrid data kind (i.e. structured/unstructured data) with the cell/interface software design. Structured data are very suitable for GPU computation since it maximizes coalesced global device memory access by many CUDA threads. On the other hand generation of structured data/grids is challenging if not impossible for many problems with complicated geometries. Unstructured data mesh generation is a common practice for complicated geometries, however these can also be handled with hybrid structured/unstructured meshing if the solver is capable of processing the different data areas. The present work emphasizes the improved performance (when deployed on GPUs) if a CFD solution is designed to isolate structured and unstructured data regions fully integrated within a multigrid solution.

To maximize GPU acceleration it is important to increase use of structured data and resort to unstructured data only where it is necessary. Following this approach requires a design of a flexible data structure that permits seamless transition between structured and unstructured data blocks at the interfaces and within the algebraic multigrid solver. So in this work a data structure that allows this methodology is presented with details. The benefits of using this true hybrid structured/unstructured approach that allows maximal use of structured data are shown with two submarine incident flow problem examples.

Section snippets

Manycore CFD software design

In the present study EXN/Aero CFD solver is employed which has been designed from the outset to accommodate changes towards manycore computing. The software presently utilizes MPI, OpenMP and CUDA implementations for parallelization and solves the Reynolds-Averaged Navier-Stokes (RANS) or Large Eddy Simulations (LES) turbulent fluid flow equations across multiple GPUs and CPUs.

EXN/Aero is equipped with the parareal algorithm [20] to achieve temporal parallelization where coarse grain and fine

Additive correction multigrid

In this section Additive Correction Multigrid method is discussed in detail. A data structure for unstructured meshes that is compatible with multigrid methods and that is suitable for mixed data solution is presented. Also since solution of hybrid problems with mixed structured and unstructured mesh blocks on GPU architectures is being sought, interface operations between neighboring blocks becomes non-trivial. For this reason Section 3.2 discusses the interface design that is incorporated in

Multigrid performance tests

This subsection is divided into two parts. In the first part the performance of the proposed unstructured mesh data structure i.e. mds_t is assessed and in the second part the performance of the amg_t data structure for the unstructured mesh is tested and compared against the performance of the structured ACM implementation. These performance tests are repeated on GPU as well to illustrate the behavior of the unstructured mesh against the structured counterpart on the manycore architectures. A

Governing equations and results of a case study

In the next subsection the governing equations of fluid dynamic problems is briefly reviewed which is then followed by a subsection explaining the tests that has been performed to demonstrate the benefits of using mixed data when solving problems of practical interest on GPU accelerated environments.

Conclusion

The geometry of practical fluid dynamic problems is most likely very complicated and while employing unstructured computational grids might be convenient, it might not be the most efficient method. Hence it is important to maximize the use of structured subdomains in the grid as much as possible to benefit from the fast solution of their memory aligned data. On the other hand performance benefits of structured grids over unstructured grids is more pronounced when using GPU co-processors since

Acknowledgement

This work has been supported by the Natural Science and Engineering Research Council (NSERC) of Canada Discovery Grant of the second author.

Araz Eghbal is currently a PhD candidate in Mechanical Engineering at University of New Brunswick (Fredericton-Canada). He obtained his BSc in Physics from Amirkabir University of Technology (Tehran-Iran) in 2008 and MSc in Gravitation and Cosmology from Shahid Beheshti University (Tehran-Iran) in 2011.

References (38)

  • N. Bell et al.

    Exposing fine-grained parallelism in algebraic multigrid methods

    SIAM J. Sci. Comput.

    (2012)
  • N. Bell et al.

    Efficient sparse matrix-vector multiplication on CUDA. Technical report

    (2008)
  • N. Bell et al.
    (2012)
  • A. Corrigan et al.

    Semi-automatic porting of a large-scale Fortran CFD code to GPUS

    Int. J. Numer. Methods Fluids

    (2012)
  • A. Corrigan et al.

    Running unstructured grid-based CFD solvers on modern graphics hardware

    Int. J. Numer. Methods Fluids

    (2011)
  • A.G. Gerber et al.

    Benchmarking of a massively parallel hybrid CFD solver for ocean applications

  • D. Goddeke et al.

    Using GPUS to improve multigrid solver performance on a cluster

    Int. J. Comput. Sci. Eng.

    (2008)
  • T. Hauser et al.

    Optimization of a computational fluid dynamics code for the memory hierarchy: a case study

    Int. J. High Perform. Comput. Appl.

    (2010)
  • J.A. Herdman et al.

    Accelerating hydrocodes with OpenACC, OpeCL and CUDA

  • Cited by (0)

    Araz Eghbal is currently a PhD candidate in Mechanical Engineering at University of New Brunswick (Fredericton-Canada). He obtained his BSc in Physics from Amirkabir University of Technology (Tehran-Iran) in 2008 and MSc in Gravitation and Cosmology from Shahid Beheshti University (Tehran-Iran) in 2011.

    Dr. Andrew G. Gerber is professor and chair at the department of Mechanical Engineering at the University of New Brunswick. His research interests are Computational Fluid Dynamics, Thermodynamics and Multiphase Flows.

    Dr. Eric Aubanel is professor at the Faculty of Computer Science at University of New Brunswick. His area of research is High Performance Computing. He is a part of IBM Center for Advanced Studies-Atlantic and is associated with the Atlantic Computational Excellence Network (ACEnet).

    View full text