Projecting LBM performance on Exascale class Architectures: A tentative outlook

doi:10.1016/j.jocs.2021.101447

Journal of Computational Science

Volume 55, October 2021, 101447

https://doi.org/10.1016/j.jocs.2021.101447 Get rights and content

Highlights

•
Lattice Boltzmann method.
•
Exascale class facilities.
•
High performance computing (HPC).
•
GPU acceleration.

Abstract

In the next years, the first Exascale class supercomputers will go online. In this paper, we portray a prospective Exascale scenario for a Lattice Boltzmann (LB) code, based on the performance obtained exploiting an up-to-date Petascale HPC facility. Although extrapolation is always a perilous exercise, in this work we aim to lay down a few guidelines to help the LBM community to carefully plan larger and more complex simulations using the forthcoming Exascale supercomputer machines.

Introduction

“Exascale computing” refers to computational systems performing High Performance Linpack (HPL [1], a library for dense matrix–matrix product), capable of delivering Exaflops, i.e. at least $10^{18}$ floating-point operations per second (Flops) [2].

HPL is currently the most adopted tool to rank supercomputers [3]; however, for lattice Boltzmann method (LBM)-based codes, as well as for any computational fluid dynamics (CFD) solver, it is still unclear which is the actual performance limit that can be achieved by employing a real application code on such facilities.

Other benchmarks, like HPCG [4], have been introduced to be deliver more realistic, “on-field” evaluations as a reference for real-world codes but, up to now, HPL is still the undiscussed metric for supercomputing ranking. In recent years, the remarkable improvements in computational performance have made possible the achievement of important milestones in scientific and technological investigations, shedding light on complex phenomena and even pointing to novel paths for future, multi-disciplinary investigations, [5]. The purpose of this work is to provide a plausible outlook on the still unknown Exascale-class machines, by extrapolating the computational performance obtained through a Petascale-class facility, namely Marconi100 (ranked 14th in the June 2021 Top500 list, [3]), with a single-phase, single-component BGK-LBM code, as reference.

For the sake of brevity, all the issues related to a general workflow, like I/O and pre/post-processing, are not described in this work. In the following, we will focus on the computational performance by considering the classical problem of the 3D lid-driven cavity, for which we have chased the best possible performance, to set a reference bar for the LBM community. Both single node and cluster-wide performance figures will be presented.

Besides this reference case, we provide the results of extreme fluid dynamic simulations, beyond the state-of-the-art of current CFD, which have been carried out with the same code. Such simulations have delivered unprecedented insights for biological and engineering investigations [5], [6], [7]. We wish to stress that no fine-tuned optimization, like assembler coding, has been used for the present benchmarks, to grant the maximum possible portability, delivering a reasonable performance that can provide a reference for a broad community of LBM users and developers.

The present work is organized as follows: in Section 1, we briefly describe the main known features related to the forthcoming Exascale-class HPC facilities, in terms of the number of nodes, accelerators and other preeminent characteristics.

In Section 2, a very brief introduction on the Lattice Boltzmann Method is proposed, with a focus on the algorithm computational performance.

Section 3 describes the adopted optimization procedures and proposes performance results, with a particular focus on performance portability.

In Section 4 results, for a Petascale-class machine, are reported and a tentative extrapolation to the Exascale-class is proposed. Finally, some overall conclusions are drawn about the Exascale scenario for LBM methods, in Section 5.

Section snippets

Exascale class supercomputers: main features

This class of supercomputers, originally scheduled to be delivered in 2018, are expected to go online in the next two years.

The main limitation for this HPC-class is represented by their power consumption. With the current technology, to keep dissipation as low as possible, it is mandatory to use accelerators, like general purpose GPUs (GPGPUs) or other, ad hoc-realized accelerators. Such devices allow a higher Flops-to-Watt ratio, compared to standard CPUs: an Exascale facility, in fact,

LBM in a nutshell

The LB method was developed in the late 1980s as lattice gas cellular automata evolution for fluid dynamic simulations [14]. In the following 30 years, LBM was characterized by an impressive growth of applications across a remarkably broad spectrum of complex flow problems, from fully developed turbulence to micro and nanofluidics [15], [16], all the way down to quark-gluon plasmas [17], [18].

The main idea is to solve a minimal Boltzmann kinetic equation for a set of discrete distribution

LBM performance

The metric used to address computational performance is Mega Lattice Update Per second (MLUPs). For problems characterized by a computational size large enough to fill all cache levels, MLUP figures are independent of CPU architecture, language and problem characteristics, thus providing a fair comparison between different codes, for similar test cases. Conversion to MFlops can be easily obtained from the floating-point operations performed for a single gridpoint.

All the results reported in the

Results for a Petascale-class supercomputers and extrapolation to Exascale

In Table 3, weak scaling is found, using Marconi100 machine, [25]. For each node, equipped with 4 GPU V100, a fixed domain of $740^{3}$ gridpoints was used, with a single-precision computation. Sustained performance of $\sim$ 2.0 PFlops has been achieved using a $29$ PFlops machine. It is 6.9% of peak performance, quite far from $\sim$ 70% that HPL can achieve, confirming the dramatic difference in performance between the HPL benchmark and a real CFD code.

In [28], [29], performance of different LBM code on

Conclusions

Starting from a real LBM code, a realistic performance running on an Exascale-class machine is estimated. Today, slightly more than 5% of the peak performance is the actually achievable limit, far below the ratio struck by HPL to rank supercomputer performance.

Similar performance for a single compute node is reached on four different architectures, together with Roofline predictions for both V100 and A200, suggesting that we are not far from reaching the limit for a CFD code with an arithmetic

Authors’ contribution

G.F. and S.S. coordinated the research; G.A. developed the advanced LBM code with G.F.; G.A. extended the code for massively parallel and GPU computation and run the simulations on the different HPC architectures; G.F., V.K.K. and G.A. interpreted the results; P.F. designed the E. aspergillum models; G.A., G.F. and S.S. wrote the manuscript. All Authors reviewed and accepted the paper.

Conflict of interest

None.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgments

G.A. wishes to thank Ingolf Staerk and all Fujitsu staff for access to A64fx CPU and for the support, and Vittorio Ruggiero for the support in digging for the right compiler options. S.S. wishes to acknowledge financial support from the European Research Council under the Horizon 2020 Programme Advanced Grant Agreement no. 739964 (“COPMAT”). G.F wishes to acknowledge the CINECA Computational Grant ISCRA-B IsB17 – “SPONGES”, id. HP10B9ZOKQ and, partially, the support of PRIN Project CUP

Giorgio Amati got his degree in Physics at Univ. of Rome “La Sapienza” (1994) and Ph.D. in Fluid Dynamics at Univ. of Rome “La Sapienza” (1998). From 11/1998 to 2/2012 he was technology officer at CASPUR supercomputing Center, Rome. From 2/2017 to 7/2013 he got a post doc position for en ERC grant at Univ. “Tor Vergata” in Rome. From 7/2013 up to now he is second lever support, actually involved in CFD, HPC benchmark and technology scouting

References (29)

S. Succi et al.
Towards exascale lattice Boltzmann computing
Comput. Fluids
(2019)
G. Falcucci et al.
Heterogeneous catalysis in pulsed-flow reactors with nanoporous gold hollow spheres
Chem. Eng. Sci.
(2017)
M. Wittmann et al.
LBM Lattice Boltzmann benchmark kernels as a testbed for performance analysis
Comput. Fluids
(2018)
G. Falcucci et al.
Extreme flow simulations reveal skeletal adaptations of deep-sea sponges
Nature
(2021)
F. Chirigati
Fluid dynamic behavior of deep-sea sponges
Nat. Comput. Sci.
(2021)
V.K. Krastev et al.
On the effects of surface corrugation on the hydrodynamic performance of cylindrical rigid structures
Eur. Phys. J. E
(2018)

Cited by (12)

Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs
2024, International Journal of Heat and Mass Transfer
We utilize the Open Accelerator (OpenACC) approach for graphics processing unit (GPU) accelerated particle-resolved thermal lattice Boltzmann (LB) simulation. We adopt the momentum-exchange method to calculate fluid-particle interactions to preserve the simplicity of the LB method. To address load imbalance issues, we extend the indirect addressing method to collect fluid-particle link information at each timestep and store indices of fluid-particle link in a fixed index array. We simulate the sedimentation of 4,800 hot particles in cold fluids with a domain size of 4000², and the simulation achieves 1750 million lattice updates per second (MLUPS) on a single GPU. Furthermore, we implement a hybrid OpenACC and message passing interface (MPI) approach for multi-GPU accelerated simulation. This approach incorporates four optimization strategies, including building domain lists, utilizing request-answer communication, overlapping communications with computations, and executing computation tasks concurrently. By reducing data communication between GPUs, hiding communication latency through overlapping computation, and increasing the utilization of GPU resources, we achieve improved performance, reaching 10846 MLUPS using 8 GPUs. Our results demonstrate that the OpenACC-based GPU acceleration is promising for particle-resolved thermal lattice Boltzmann simulation.
A workflow for rapid assessment of complex courtyard wind environment based on parallel lattice Boltzmann method
2023, Building and Environment
A fast courtyard wind simulation platform based on the parallel courtyard Lattice Boltzmann Method (CLBM) is proposed to achieve comprehensive optimization of courtyard design structure and wind environment performance. The simulation workflow of CLBM is designed and implemented, and its accuracy is validated using two benchmark cases and Qinghui Garden in central Guangdong. The results show that CLBM achieves simulation accuracy on par with other simulations, with R² values greater than 0.8 and Root Mean Squared Error (RMSE) values less than 0.3 m/s for the building array case and RMSE values less than 0.2 m/s for the row of trees case. In the testbed courtyard, the Normalized Root Mean Squared Error (NRMSE) values range from 0.18 to 0.30, close to Ansys Fluent's NRMSE values of 0.07–0.33, and comparable to most peer studies with NRMSE values below 0.5. CLBM is also faster than traditional Computational Fluid Dynamics (CFD) models, with a workflow that is 19 times faster than Ansys Fluent and can produce wind environment simulation feedback in about 10 min. This model's speed and accuracy advantages make it well-suited for quickly simulating courtyard wind environments, even in the presence of complex combinations and rich vegetation. Architects can use this technique to regularly experiment with courtyard design elements and achieve evidence-based design of courtyard ventilation to fully realize the aesthetic appeal and maximize ventilation efficiency.
A simple one-step index algorithm for implementation of lattice Boltzmann method on GPU
2023, Computer Physics Communications
Citation Excerpt :
They demonstrated that the PS algorithm can achieve up to 8.3 GLUPS utilizing CUDA and the D3Q19 model with single precision on a single NVIDIA GPU A100, and that the mean bandwidth saturation exceeds 0.9. Furthermore, beyond implementing LBM on a single GPU, there are numerous studies [4,10–20] on the multi-GPU parallelism of LBM. Fan et al. [10] made the first attempt to utilize multi-GPUs to accelerate LBM and obtained 49.2 MLUPS on 30 GeForce FX 5800 Ultra.
We proposed a simple one-step index (OSI) algorithm for solving the lattice Boltzmann equation, particularly the streaming of particle distribution functions (PDFs) on a single grid system. The OSI algorithm is derived from the conventional A-B pattern. The memory addresses of the PDFs are fixed in this algorithm and consistent with collision principles. The streaming process is implicitly computed by reassigning their indexes corresponding to the time steps, spatial coordinates, and directions of the PDFs. The algorithm is simple to program because it reads and writes the PDFs only once per time step and does not require the synchronization of odd and even time steps. In this implementation, the data layout of the PDFs is the structure of arrays (SoA), suitable for the memory access pattern of graphics processing units (GPUs). The accuracy and single-precision performance of the proposed algorithm for the three-dimensional lid-driven cavity flow simulation with the D3Q19 model were validated and tested on an NVIDIA A100 having a 40 GB PCIe using CUDA and OpenACC. Performances of 8.4 and 8.1 giga lattice updates per second were obtained for CUDA and OpenACC, respectively. OpenACC can outperform CUDA by up to 95% with significantly less programming work. The bandwidth usage rates on a single GPU were 96% and 94% for CUDA and OpenACC, respectively, close to the theoretical values. Lattice Boltzmann method parallelism is implemented using CUDA and MPI for multi-GPU usage. Finally, computation and communication overlaps were implemented to optimize the parallel efficiency, where the weak scaling parallel efficiency exceeded 0.98 on up to 512 GPUs.
Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI
2023, International Journal of Heat and Mass Transfer
Citation Excerpt :
Advancements in HPC utilizing heterogeneous architecture, namely the combined traditional central processing unit (CPU) and the emerging accelerators (such as GPUs), further facilitate the application of LB simulations in large-scale engineering problems [14,15]. Open-source codes based on the LB method, including OpenLB [16], Palabos [17], and Sailfish [18], even aim to brace the forthcoming Exascale supercomputing [19]. Review articles on LB simulation using GPUs can be found by Navarro-Hinojosa et al. [20], Niemeyer and Sung [21].
We assess the performance of the hybrid Open Accelerator (OpenACC) and Message Passing Interface (MPI) approach for multi-graphics processing units (GPUs) accelerated thermal lattice Boltzmann (LB) simulation. The OpenACC accelerates computation on a single GPU, and the MPI synchronizes the information between multiple GPUs. With a single GPU, the two-dimension (2D) simulation achieved 1.93 billion lattice updates per second (GLUPS) with a grid number of $8193^{2}$ , and the three-dimension (3D) simulation achieved 1.04 GLUPS with a grid number of $385^{3}$ , which is more than 76% of the theoretical maximum performance. On multi-GPUs, we adopt block partitioning, overlapping communications with computations, and concurrent computation to optimize parallel efficiency. We show that in the strong scaling test, using 16 GPUs, the 2D simulation achieved 30.42 GLUPS and the 3D simulation achieved 14.52 GLUPS. In the weak scaling test, the parallel efficiency remains above 99% up to 16 GPUs. Our results demonstrated that, with improved data and task management, the hybrid OpenACC and MPI technique is promising for thermal LB simulation on multi-GPUs.
Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs
2023, arXiv
LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications
2023, arXiv

View all citing articles on Scopus

The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.

View full text

Projecting LBM performance on Exascale class Architectures: A tentative outlook

Highlights

Abstract

Introduction

Section snippets

Exascale class supercomputers: main features

LBM in a nutshell

LBM performance

Results for a Petascale-class supercomputers and extrapolation to Exascale

Conclusions

Authors’ contribution

Conflict of interest

Declaration of Competing Interest

Acknowledgments

Comput. Fluids

Chem. Eng. Sci.

Comput. Fluids

Extreme flow simulations reveal skeletal adaptations of deep-sea sponges

Nature

Fluid dynamic behavior of deep-sea sponges

Nat. Comput. Sci.

On the effects of surface corrugation on the hydrodynamic performance of cylindrical rigid structures

Eur. Phys. J. E