GPU-based flow simulation with detailed chemical kinetics

doi:10.1016/j.cpc.2012.10.013

Computer Physics Communications

Volume 184, Issue 3, March 2013, Pages 596-606

https://doi.org/10.1016/j.cpc.2012.10.013 Get rights and content

Abstract

The current paper reports on the implementation of a numerical solver on the Graphic Processing Units (GPUs) to model reactive gas mixtures with detailed chemical kinetics. The solver incorporates high-order finite volume methods for solving the fluid dynamical equations coupled with stiff source terms. The chemical kinetics are solved implicitly via an operator-splitting method. We explored different approaches in implementing a fast kinetics solver on the GPU. The detail of the implementation is discussed in the paper. The solver is tested with two high-order shock capturing schemes: MP5 (Suresh and Huynh, 1997) [9] and ADERWENO (Titarev and Toro, 2005) [10]. Considering only the fluid dynamics calculation, the speed-up factors obtained are 30 for the MP5 scheme and 55 for ADERWENO scheme. For the fully-coupled solver, the performance gain depended on the size of the reaction mechanism. Two different examples of chemistry were explored. The first mechanism consisted of 9 species and 38 reactions, resulting in a speed-up factor up to 35. The second, larger mechanism, consisted of 36 species and 308 reactions, resulting in a speed-up factor of up to 40.

Introduction

Detail study of complex physical phenomena associated with high-speed fluid flow requires a deep understanding of the fundamental physics which involves many highly non-linear processes evolving in different spatial and temporal scales. Inevitably, the challenge of solving these problems is due to the coupling mechanism between these processes and what impact they have on the flow solution. For example, in combustion study, one has to pay close attention to the coupling between the fluid transport and chemical kinetics in order to characterize the combustion process. Computational Fluid Dynamics (CFD) techniques can be used to obtain a detailed flow solution which can be applied in practical applications. Although the mathematical formulation of the physics can be addressed in great detail, numerical simulation of high-speed fluid flow in a non-equilibrium environment is often limited by the computational power demanded for solving the governing equations. Modern CFD codes are designed to take advantage of high performance computing (HPC) platform to reduce run time. Unfortunately, the tradition HPC resources are very limited due to their cost and maintenance requirement. These limitations have accentuated a need for a compact and low-cost HPC solution where numerical solvers can be effectively implemented.

During the last eight years, the Graphic Processing Unit (GPU) has been introduced as a promising alternative to high-cost HPC platforms. Within this period, the GPU has evolved into a highly capable and low-cost computing solution for scientific research. Fig. 1 illustrates the superiority of GPU over the traditional Central Processing Unit (CPU) in terms of floating point calculation. This is due to the fact that the GPU is designed for the highly parallel process of graphic rendering. Starting in 2008, the GPU began to support double precision calculation, which is necessary for scientific computing. The newest generation of NVIDIA GPUs called “Fermi” has been designed to enhance the performance on double precision calculation over the previous generations of NVIDIA GPUs. CUDA [1], which is currently the most popular programming environment for general purpose GPU computing, has undergone several development phases and reached a certain level of maturity, which is essential for the design of numerical solvers. Other alternatives such as OpenCL, DirectCompute, Stream, etc. have also began to mature which encourages new developments in scientific computing using GPU. Several attempts had been made in writing scientific codes on the GPU either by directly using CUDA or via some wrapper which calls the CUDA kernel functions, and promising results were obtained both in terms of performance and flexibility. Previous implementations of CFD codes on the GPU focused purely on solving the fluid dynamics using either finite volume [2], [3] or finite element methods [4]. Recently, there have been a number of GPU-based implementations of multiphysics simulation. Of particular note are the extension to magnetohydrodynamics simulation by Wong et al. [5] and the automated preprocessor tool to model finite rate chemical kinetics by Linford et al. [6], [7]. In all cases, the reported speed-ups show promising performance results and clearly demonstrate that GPU is suitable for massively parallel scientific computing.

In this paper, we describe the detail code implementation of a numerical solver coupling the fluid dynamics with detailed chemical kinetics. Unlike previous attempts at adapting the GPU to numerical solver, we have placed our emphasis on the kinetics solver rather than the fluid dynamics. This is due to the fact that for the simulation of high-speed fluid flow, the computation is dominated by solving the kinetics. While the current implementation is only for chemical kinetics, it is easy to extend it to a more general kinetics (collision-radiative kinetics for plasma) since all the elementary processes for a plasma (excitation/de-excitation, ionization/recombination, etc.) can be represented by a chemical reaction with the rate computed a priori and tabulated as a function of temperature.

The rest of the paper is organized as follows. The governing equation and numerical formulation for both the fluid dynamics and chemical kinetics are described in Sections 2 Governing equations, 3 Numerical formulation. Section 4 gives some background on GPU computing. The code implementation is detailed in Section 5, highlighting several optimization techniques for maximizing the performance of the solver. The results of several benchmark test cases both for non-reactive and reactive flow fields are presented in Section 6 as well as the performance results of the solver. Section 7 gives the conclusions and points out possible future works.

Section snippets

Governing equations

The flow is modeled as a mixture of gas species while neglecting viscous effects. The chemical reactions taken place between the gas components are to be modeled in great detail. The set of the Euler equations for a reactive gas mixture can be written as: $\frac{\partial Q}{\partial t} + \nabla \cdot \bar{F} = \dot{Ω}$ where $Q$ and $F$ are the vectors of conservative variables and inviscid fluxes, respectively. We assumed that there is no species diffusion and the gas is thermally equilibrium (i.e., all species have the same velocity and all the

Fluid dynamics

A dimensional splitting technique [8] is utilized for solving the convective part of the governing equations. In order to achieve high-order both in space and time, we employed a fifth-order Monotonicity-Preserving scheme (MP5) [9] for the reconstruction, and a third-order Runge–Kutta (RK3) for time integration. For the MP5 scheme, the reconstructed value of the left and right states of interface $j + \frac{1}{2}$ is given as (see Fig. 2): $u_{j + \frac{1}{2}}^{L} = \frac{1}{60} (2 u_{j - 2} - 13 u_{j - 1} + 47 u_{j} + 27 u_{j + 1} - 3 u_{j + 2})$ $u_{j + \frac{1}{2}}^{R} = \frac{1}{60} (2 u_{j + 3} - 13 u_{j + 2} + 47 u$

GPU computing

The GPU processes data in a Single-Instruction-Multiple-Thread (SIMT) manner. The instruction for executing on the GPU is called a kernel which is invoked from the host (CPU). The CUDA programming model consists of grid and thread block. A grid consists of multiple thread blocks and each thread block contains a number of threads. When a kernel is called, the scheduler unit on the device will automatically assign a group of thread blocks to the number of available streaming multi-processors (SM

Implementation

The overall implementation of the code can be divided into two parts: the fluid dynamics and kinetics. The fluid dynamics module is responsible for the advection calculation. The kinetics module, on the other hand, calculates the species consumption/production due to chemical reactions and ensures detail balance is satisfied. The overall flow chart of the program is shown in Fig. 4 (see also Fig. 5 for the flow chart of the fluid dynamics module). After all the flow variables have been

Solver results

The first objective is to verify that the solver is correctly implemented using the CUDA kernels. For this purpose, we can compare the results with a pure-CPU version, but also compute a set of standard test cases. The first of those is a Mach 3 wind tunnel problem (a.k.a. the forward step problem) using the MP5 scheme, whose solution is shown in Fig. 8. This problem had been utilized by Woodward and Colella [14] to test a variety of numerical schemes. The whole domain is initialized with

Conclusion and future works

In the current paper, we described the implementation of a numerical solver for simulating chemically reacting flow on the GPU. The fluid dynamics is modeled using high-order shock-capturing schemes, and the chemical kinetics is solved using an implicit solver. Results of both the fluid dynamics and chemical kinetics are shown. Considering only the fluid dynamics, we obtained a speed-up of 30 and 55 times compared to the CPU version for the MP5 and ADERWENO scheme, respectively. For the

Acknowledgment

The authors would like to thank Prof. Ann Karagozian of UCLA for countless support on performing simulation on the Hoffman2 GPU cluster.

References (22)

E. Elsen et al.
Large calculation of the flow over a hypersonic vehicle using a GPU
J. Comput. Phys.
(2008)
A. Klockner et al.
Nodal discontinuous Galerkin methods on graphics processors
J. Comput. Phys.
(2009)
H.-C. Wong et al.
Efficient magnetohydrodynamic simulations on graphics processing units with CUDA
Comput. Phys. Comm.
(2011)
A. Suresh et al.
Accurate monotonicity-preserving schemes with Runge–Kutta time stepping
J. Comput. Phys.
(1997)
V.A. Titarev et al.
ADER schemes for three-dimensional nonlinear hyperbolic systems
J. Comput. Phys.
(2005)
B. Einfeldt et al.
On Godunov-type methods near low densities
J. Comput. Phys.
(1991)
P. Woodward et al.
The numerical simulation of two-dimensional fluid flow with strong shocks
J. Comput. Phys.
(1984)
NVIDIA Corporation, Compute Unified Device Architecture Programming Guide version 4.0,...
T. Brandvik, G. Pullan, Acceleration of a 3D Euler Solver using Commodity Graphics Hardware, in: 46th AIAA Aerospace...
J. Linford et al.
Automatic generation of multi-core chemical kernels
IEEE Transactions on Parallel and Distributed Systems
(2011)

J. Linford, J. Michalakes, M. Vachharijani, A. Sandu, Multi-core acceleration of chemical kinetics for modeling and...

Cited by (23)

TChem: A performance portable parallel software toolkit for complex kinetic mechanisms
2023, Computer Physics Communications
Citation Excerpt :
Outside the combustion community, CAMP (Chemistry Across Multiple Phases) [16] provides tools for atmospheric chemistry and is currently under development to accommodate heterogeneous computing platforms with GPU accelerators. Several research efforts explored the utilization of GPUs to handle complex kinetic models for combustion [17–28] and atmospheric chemistry science [29,30]. Spafford [20] developed GPU kernels for evaluating the rates of chemical reactions, mapping a single GPU thread to each spatial grid point and a block of threads to each computational sub-domain.
We present TChem, a performance portable software toolkit to enable the analysis of complex kinetic mechanisms. The software provides tools for gas-phase and surface chemistry, thermodynamic properties, and implements formulations for several canonical reactor models. Analytical derivatives necessary to construct Jacobian matrices corresponding to all implemented functionalities are available through automatic differentiation. TChem uses the Kokkos framework to achieve portability across multiple heterogeneous computing platforms with a single version of the code. We implement a hierarchical parallelism framework to enable efficient chemical source term and thermodynamic property evaluations over the number of samples assigned to the local computing device. We analyze parallel efficiency results extracted from test cases for thermodynamic properties, source terms, and Jacobians evaluations on Intel Xeon CPUs and NVIDIA Volta GPUs.
Program Title: TChem
CPC Library link to program files: https://doi.org/10.17632/25prt5g35w.1
Developer's repository link: https://github.com/sandialabs/TChem
Licensing provisions: BSD 2-clause
Programming language: C++
Nature of problem: TChem seeks efficient solutions of models that involve complex chemical kinetic mechanisms, which are essential for various applications areas such as combustion modeling, atmospheric chemistry and catalysis. In general, a detailed chemical kinetic mechanism involves a number of species and associated elementary reactions. The system of governing equations is typically highly non-linear and stiff, involving a large range of time scales.
Solution method: TChem provides source term functions and numerical/analytic Jacobians for canonical reactor models. In particular, the analytic Jacobian is computed using automatic differentiation techniques implemented in the SACADO library. For solving stiff time ODE/DAE systems, TChem uses the Tines implementation of an adaptive 2 $^{n d}$ -order backward difference formula to obtain efficient solutions that evolve on a range of time scales. The software employs batch parallelism to exploit massively parallel modern computing devices. Target workflows correspond to reactive flow problems or uncertainty quantification studies, which require the solutions of many samples corresponding to different grid points or input conditions. The software provides performance portable implementations aiming to use modern heterogeneous computing platforms i.e., multi/many core host processors with external GPU accelerators. We achieve the performance portability using the Kokkos parallel programming model that enables translation of a single Kokkos implementation to multiple device-specific backend codes.
Accelerating turbulent reacting flow simulations on many-core/GPUs using matrix-based kinetics
2023, Proceedings of the Combustion Institute
The present work assesses the impact - in terms of time to solution, throughput analysis, and hardware scalability - of transferring computationally intensive tasks, found in compressible reacting flow solvers, to the GPU. Attention is focused on outlining the workflow and data transfer penalties associated with “plugging in” a recently developed GPU-based chemistry library into (a) a purely CPU-based solver and (b) a GPU-based solver, where, except for the chemistry, all other variables are computed on the GPU. This comparison allows quantification of host-to-device (and vice versa) data transfer penalties on the overall solver speedup as a function of mesh and reaction mechanism size. To this end, a recently developed GPU-based chemistry library known as UMChemGPU is employed to treat the kinetics in the flow solver KARFS. UMChemGPU replaces conventional CPU-based Cantera routines using a matrix-based formulation. The impact of i) data transfer times, ii) chemistry acceleration, and iii) the hardware architecture is studied in detail in the context of GPU saturation limits. Hydrogen and dimethyl ether (DME) reaction mechanisms are used to assess the impact of the number of species/reactions on overall/chemistry-only speedup. It was found that offloading the source term computation to UMChemGPU results in up to 7X reduction in overall time to solution and four orders of magnitude faster source term computation compared to conventional CPU-based methods. Furthermore, the metrics for achieving maximum performance gain using GPU chemistry with an MPI + CUDA solver are explained using the Roofline model. Integrating the UMChemGPU with an MPI + OpenMP solver does not improve the overall performance due to the associated data copy time between the device (GPU) and host (CPU) memory spaces. The performance portability was demonstrated using three different GPU architectures, and the findings are expected to translate to a wide variety of high-performance codes in the combustion community.
An investigation of hybrid CPU-GPU solvers for supersonic reacting flow simulation with detailed chemical kinetics
2022, Aerospace Science and Technology
Accurate simulations of supersonic reacting flow require the use of detailed chemical kinetics to capture the complex phenomena such as auto-ignition and broken, thickened flamelets. However, the detailed chemical mechanisms for hydrocarbon fuels used in supersonic combustion typically have a great number of species and reactions. This paper reports on the implementation of a hybrid CPU-GPU solver by using the graphics processing unit (GPU) as a coprocessor to accelerate the computation of chemical source terms in modeling supersonic reacting flows with detailed chemical kinetics. The chemical kinetics are solved explicitly via a fully-coupled method with an implicit scheme. Three parallel strategies with different storage schemes were explored in implementing the highly parallel GPU algorithms for accelerating combustion modeling. Significant acceleration was obtained by using these parallel algorithms for chemical source terms evaluation. In order to verify the viability of the solver, two supersonic combustion flow test cases were conducted. The performance of these algorithms is evaluated by modeling four chemical reaction mechanisms on different grids compared with the CPU-only solver. The computation cost of these algorithms is basically proportional to the size of the grids. The performance gain of these parallel algorithms depends on the size of the reaction mechanism. A maximum speedup factor of over 80 was obtained by modeling the mechanism consisting of 348 species and 2163 reactions. The significant performance improvement provided by these parallel algorithms can provide a significant perspective for designing GPU-accelerated algorithms for applications in simulating supersonic reacting flows.
Program package MPGOS: Challenges and solutions during the integration of a large number of independent ODE systems using GPUs
2021, Communications in Nonlinear Science and Numerical Simulation
Challenges and efficient solution techniques during the integration of a large number of independent ordinary differential equations (ODEs) using the massively parallel architecture of graphics processing units (GPUs) are presented. One of the main difficulties is the minimisation of the memory transactions through the PCI-E bus between the host (CPU) and the device (GPU) required frequently, for instance, during the calculation of the Lyapunov exponent, winding number or maximum response diagram. The second difficulty is the minimisation of the slow global memory transactions and memory usage by exploiting the memory hierarchy of the GPU architecture. Finally, a good GPU solver has to treat the possible asynchronous features of the ODE systems efficiently; for instance, event detection occurring at distinct time instances or handling the orders of magnitude difference in the required number of time steps of the different ODE systems. The program package MPGOS (written in C++ and CUDA C software environments) can address the aforementioned issues easily via the addition of user-defined functions that must be implemented similarly to the right-hand side of the system; via the possibility of the definition of shared parameters common to all instances of the independent ODE systems; via user-programmable parameters to store only the desired properties of the trajectories; and via an easy was to overlap GPU and CPU computations. This paper focuses on the detailed description of the implementation strategies of the program package.
Advanced model for the interaction of a Ti plume produced by a ns-pulsed laser in a nitrogen environment
2021, Spectrochimica Acta - Part B Atomic Spectroscopy
Citation Excerpt :
In recent years a 2D fluid dynamic code implementing the state-to-state kinetics has been applied to the entry in the Earth atmosphere [41–44], considerably speeding up the calculation with the aid of GPU's (Graphical Processing Unit), allowing 2D state-to-state modeling also on a desktop computer. The present algorithm reached a speed-up of 100, almost independently of the number of species, proving to be more effective than previous approaches on GPU accelerated reactive fluid dynamics codes, see e.g. [45]. The purpose of the present paper is to investigate the role of vibrational non-equilibrium in the evolution of a titanium plasma plume expanding in the nitrogen environment, using the 2D computing platform mentioned above.
The expansion of a titanium plume evaporated by a ns laser pulse and expanding in nitrogen environment has been investigated by means of a numerical model. The 2D fluid dynamic equations have been coupled with a chemical model implementing the state-to-state kinetics of nitrogen molecules, focusing on the role of molecular vibration in the plasma evolution. The calculated vibrational distributions considerably depart from the Boltzmann one, showing overpopulated tails. Large differences with results obtained considering macroscopic models have been predicted.
Using SIMD and SIMT vectorization to evaluate sparse chemical kinetic Jacobian matrices and thermochemical source terms
2018, Combustion and Flame
Accurately predicting key combustion phenomena in reactive-flow simulations, e.g., lean blow-out, extinction/ignition limits and pollutant formation, necessitates the use of detailed chemical kinetics. The large size and high levels of numerical stiffness typically present in chemical kinetic models relevant to transportation/power-generation applications make the efficient evaluation/factorization of the chemical kinetic Jacobian and thermochemical source-terms critical to the performance of reactive-flow codes. Here we investigate the performance of vectorized evaluation of constant-pressure/volume thermochemical source-term and sparse/dense chemical kinetic Jacobians using single-instruction, multiple-data (SIMD) and single-instruction, multiple thread (SIMT) paradigms. These are implemented in pyJac, an open-source, reproducible code generation platform. Selected chemical kinetic models covering the range of sizes typically used in reactive-flow simulations were used for demonstration. A new formulation of the chemical kinetic governing equations was derived and verified, resulting in Jacobian sparsities of 28.6–92.0% for the tested models. Speedups of 3.40–4.08 ×  were found for shallow-vectorized OpenCL source-rate evaluation compared with a parallel OpenMP code on an avx2 central processing unit (CPU), increasing to 6.63–9.44 ×  and 3.03–4.23 ×  for sparse and dense chemical kinetic Jacobian evaluation, respectively. Furthermore, the effect of data-ordering was investigated and a storage pattern specifically formulated for vectorized evaluation was proposed; as well, the effect of the constant pressure/volume assumptions and varying vector widths were studied on source-term evaluation performance. Speedups reached up to 17.60 ×  and 45.13 ×  for dense and sparse evaluation on the GPU, and up to 55.11 ×  and 245.63 ×  on the CPU over a first-order finite-difference Jacobian approach. Further, dense Jacobian evaluation was up to 19.56 ×  and 2.84 ×  times faster than a previous version of pyJac on a CPU and GPU, respectively. Finally, future directions for vectorized chemical kinetic evaluation and sparse linear-algebra techniques were discussed.

View all citing articles on Scopus

View full text

GPU-based flow simulation with detailed chemical kinetics

Abstract

Introduction

Section snippets

Governing equations

Fluid dynamics

GPU computing

Implementation

Solver results

Conclusion and future works

Acknowledgment

J. Comput. Phys.

J. Comput. Phys.

Comput. Phys. Comm.

J. Comput. Phys.

J. Comput. Phys.

J. Comput. Phys.

J. Comput. Phys.

Automatic generation of multi-core chemical kernels

IEEE Transactions on Parallel and Distributed Systems