Projecting LBM performance on Exascale class Architectures: A tentative outlook

https://doi.org/10.1016/j.jocs.2021.101447Get rights and content

Highlights

Abstract

In the next years, the first Exascale class supercomputers will go online. In this paper, we portray a prospective Exascale scenario for a Lattice Boltzmann (LB) code, based on the performance obtained exploiting an up-to-date Petascale HPC facility. Although extrapolation is always a perilous exercise, in this work we aim to lay down a few guidelines to help the LBM community to carefully plan larger and more complex simulations using the forthcoming Exascale supercomputer machines.

Introduction

Exascale computing” refers to computational systems performing High Performance Linpack (HPL [1], a library for dense matrix–matrix product), capable of delivering Exaflops, i.e. at least 1018 floating-point operations per second (Flops) [2].

HPL is currently the most adopted tool to rank supercomputers [3]; however, for lattice Boltzmann method (LBM)-based codes, as well as for any computational fluid dynamics (CFD) solver, it is still unclear which is the actual performance limit that can be achieved by employing a real application code on such facilities.

Other benchmarks, like HPCG [4], have been introduced to be deliver more realistic, “on-field” evaluations as a reference for real-world codes but, up to now, HPL is still the undiscussed metric for supercomputing ranking. In recent years, the remarkable improvements in computational performance have made possible the achievement of important milestones in scientific and technological investigations, shedding light on complex phenomena and even pointing to novel paths for future, multi-disciplinary investigations, [5]. The purpose of this work is to provide a plausible outlook on the still unknown Exascale-class machines, by extrapolating the computational performance obtained through a Petascale-class facility, namely Marconi100 (ranked 14th in the June 2021 Top500 list, [3]), with a single-phase, single-component BGK-LBM code, as reference.

For the sake of brevity, all the issues related to a general workflow, like I/O and pre/post-processing, are not described in this work. In the following, we will focus on the computational performance by considering the classical problem of the 3D lid-driven cavity, for which we have chased the best possible performance, to set a reference bar for the LBM community. Both single node and cluster-wide performance figures will be presented.

Besides this reference case, we provide the results of extreme fluid dynamic simulations, beyond the state-of-the-art of current CFD, which have been carried out with the same code. Such simulations have delivered unprecedented insights for biological and engineering investigations [5], [6], [7]. We wish to stress that no fine-tuned optimization, like assembler coding, has been used for the present benchmarks, to grant the maximum possible portability, delivering a reasonable performance that can provide a reference for a broad community of LBM users and developers.

The present work is organized as follows: in Section 1, we briefly describe the main known features related to the forthcoming Exascale-class HPC facilities, in terms of the number of nodes, accelerators and other preeminent characteristics.

In Section 2, a very brief introduction on the Lattice Boltzmann Method is proposed, with a focus on the algorithm computational performance.

Section 3 describes the adopted optimization procedures and proposes performance results, with a particular focus on performance portability.

In Section 4 results, for a Petascale-class machine, are reported and a tentative extrapolation to the Exascale-class is proposed. Finally, some overall conclusions are drawn about the Exascale scenario for LBM methods, in Section 5.

Section snippets

Exascale class supercomputers: main features

This class of supercomputers, originally scheduled to be delivered in 2018, are expected to go online in the next two years.

The main limitation for this HPC-class is represented by their power consumption. With the current technology, to keep dissipation as low as possible, it is mandatory to use accelerators, like general purpose GPUs (GPGPUs) or other, ad hoc-realized accelerators. Such devices allow a higher Flops-to-Watt ratio, compared to standard CPUs: an Exascale facility, in fact,

LBM in a nutshell

The LB method was developed in the late 1980s as lattice gas cellular automata evolution for fluid dynamic simulations [14]. In the following 30 years, LBM was characterized by an impressive growth of applications across a remarkably broad spectrum of complex flow problems, from fully developed turbulence to micro and nanofluidics [15], [16], all the way down to quark-gluon plasmas [17], [18].

The main idea is to solve a minimal Boltzmann kinetic equation for a set of discrete distribution

LBM performance

The metric used to address computational performance is Mega Lattice Update Per second (MLUPs). For problems characterized by a computational size large enough to fill all cache levels, MLUP figures are independent of CPU architecture, language and problem characteristics, thus providing a fair comparison between different codes, for similar test cases. Conversion to MFlops can be easily obtained from the floating-point operations performed for a single gridpoint.

All the results reported in the

Results for a Petascale-class supercomputers and extrapolation to Exascale

In Table 3, weak scaling is found, using Marconi100 machine, [25]. For each node, equipped with 4 GPU V100, a fixed domain of 7403 gridpoints was used, with a single-precision computation. Sustained performance of 2.0 PFlops has been achieved using a 29 PFlops machine. It is 6.9% of peak performance, quite far from 70% that HPL can achieve, confirming the dramatic difference in performance between the HPL benchmark and a real CFD code.

In [28], [29], performance of different LBM code on

Conclusions

Starting from a real LBM code, a realistic performance running on an Exascale-class machine is estimated. Today, slightly more than 5% of the peak performance is the actually achievable limit, far below the ratio struck by HPL to rank supercomputer performance.

Similar performance for a single compute node is reached on four different architectures, together with Roofline predictions for both V100 and A200, suggesting that we are not far from reaching the limit for a CFD code with an arithmetic

Authors’ contribution

G.F. and S.S. coordinated the research; G.A. developed the advanced LBM code with G.F.; G.A. extended the code for massively parallel and GPU computation and run the simulations on the different HPC architectures; G.F., V.K.K. and G.A. interpreted the results; P.F. designed the E. aspergillum models; G.A., G.F. and S.S. wrote the manuscript. All Authors reviewed and accepted the paper.

Conflict of interest

None.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgments

G.A. wishes to thank Ingolf Staerk and all Fujitsu staff for access to A64fx CPU and for the support, and Vittorio Ruggiero for the support in digging for the right compiler options. S.S. wishes to acknowledge financial support from the European Research Council under the Horizon 2020 Programme Advanced Grant Agreement no. 739964 (“COPMAT”). G.F wishes to acknowledge the CINECA Computational Grant ISCRA-B IsB17 – “SPONGES”, id. HP10B9ZOKQ and, partially, the support of PRIN Project CUP

Giorgio Amati got his degree in Physics at Univ. of Rome “La Sapienza” (1994) and Ph.D. in Fluid Dynamics at Univ. of Rome “La Sapienza” (1998). From 11/1998 to 2/2012 he was technology officer at CASPUR supercomputing Center, Rome. From 2/2017 to 7/2013 he got a post doc position for en ERC grant at Univ. “Tor Vergata” in Rome. From 7/2013 up to now he is second lever support, actually involved in CFD, HPC benchmark and technology scouting

References (29)

  • Cited by (12)

    • A simple one-step index algorithm for implementation of lattice Boltzmann method on GPU

      2023, Computer Physics Communications
      Citation Excerpt :

      They demonstrated that the PS algorithm can achieve up to 8.3 GLUPS utilizing CUDA and the D3Q19 model with single precision on a single NVIDIA GPU A100, and that the mean bandwidth saturation exceeds 0.9. Furthermore, beyond implementing LBM on a single GPU, there are numerous studies [4,10–20] on the multi-GPU parallelism of LBM. Fan et al. [10] made the first attempt to utilize multi-GPUs to accelerate LBM and obtained 49.2 MLUPS on 30 GeForce FX 5800 Ultra.

    • Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

      2023, International Journal of Heat and Mass Transfer
      Citation Excerpt :

      Advancements in HPC utilizing heterogeneous architecture, namely the combined traditional central processing unit (CPU) and the emerging accelerators (such as GPUs), further facilitate the application of LB simulations in large-scale engineering problems [14,15]. Open-source codes based on the LB method, including OpenLB [16], Palabos [17], and Sailfish [18], even aim to brace the forthcoming Exascale supercomputing [19]. Review articles on LB simulation using GPUs can be found by Navarro-Hinojosa et al. [20], Niemeyer and Sung [21].

    View all citing articles on Scopus

    Giorgio Amati got his degree in Physics at Univ. of Rome “La Sapienza” (1994) and Ph.D. in Fluid Dynamics at Univ. of Rome “La Sapienza” (1998). From 11/1998 to 2/2012 he was technology officer at CASPUR supercomputing Center, Rome. From 2/2017 to 7/2013 he got a post doc position for en ERC grant at Univ. “Tor Vergata” in Rome. From 7/2013 up to now he is second lever support, actually involved in CFD, HPC benchmark and technology scouting

    The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.

    View full text