Streaming techniques: revealing the natural concurrency of the lattice Boltzmann method

Zakirov, Andrey; Perepelkina, Anastasia; Levchenko, Vadim; Khilkov, Sergey

doi:10.1007/s11227-021-03762-z

Streaming techniques: revealing the natural concurrency of the lattice Boltzmann method

Published: 31 March 2021

Volume 77, pages 11911–11929, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Andrey Zakirov¹,
Anastasia Perepelkina ORCID: orcid.org/0000-0003-2517-6064²,
Vadim Levchenko² &
…
Sergey Khilkov³

363 Accesses
7 Citations
Explore all metrics

Abstract

The LBM produces stencil numerical schemes which fall into the memory-bound domain. Therefore the performance may be multiplied if the arithmetic intensity is increased. In this paper, the data flow arrangement possibilities at the streaming step are explored while aiming for the development of the most efficient algorithms and implementations of the LBM schemes. The locally recursive non-locally asynchronous algorithm construction method is used for the purpose. This method is based on the analysis of the dependency graph of the task in the dD1T Minkowsky space. The schemes of well-known propagation patterns are illustrated and analyzed. With the knowledge of their advantages and drawbacks, the new propagation scheme is defined. The best propagation scheme which is constructed with this method is implemented as the program code for GPU. The description of the code and the performance results are provided. The obtained performance is up to 10 GLUps on a single nVidia GeForce RTX 3090 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

Article 01 February 2023

MT-3000: a heterogeneous multi-zone processor for HPC

Article 24 May 2022

Quantum algorithm for the advection–diffusion equation simulated with the lattice Boltzmann method

Article 05 February 2021

References

Bailey P, Myre J, Walsh SD, Lilja DJ, Saar MO (2009) Accelerating lattice boltzmann fluid flow simulations using graphics processors. In: International Conference on Parallel Processing, ICPP’09, pp. 550–557. IEEE. https://doi.org/10.1109/ICPP.2009.38
Geier M, Schönherr M (2017) Esoteric twist: an efficient in-place streaming algorithms for the lattice boltzmann method on massively parallel hardware. Computation 5(2):19. https://doi.org/10.3390/computation5020019
Article Google Scholar
Habich J, Zeiser T, Hager G, Wellein G (2009) Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization. In: 21st International Conference on Parallel Computational Fluid Dynamics, pp. 178–182
Kane Y (1966) Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media. IEEE Trans Antennas Propag 14(3):302–307
Article Google Scholar
Krger T, Kusumaatmaja H, Kuzmin A, Shardt O, Silva G, Viggen EM (2016) The lattice Boltzmann method. Princ Pract. https://doi.org/10.1007/978-3-319-44649-3
Article MATH Google Scholar
Levchenko V (2005) Asynchronous parallel algorithms as a way to archive effectiveness of computations. J Inf Tech Comput Syst 1:68–87 (in Russian)
Google Scholar
Levchenko V, Perepelkina A (2018) Locally recursive non-locally asynchronous algorithms for stencil computation. Lobachevskii J Math 39(4):552–561. https://doi.org/10.1134/S1995080218040108
Article MathSciNet MATH Google Scholar
Levchenko V, Perepelkina A, Zakirov A (2020) New compact streaming in LBM with ConeFold LRnLA algorithms. In: V. Voevodin, S. Sobolev (eds.) Supercomputing. RuSCDays 2020. Communications in Computer and Information Science, vol. 1331, pp. 50–62. https://doi.org/10.1007/978-3-030-64616-5_5
Levchenko V, Zakirov A, Perepelkina A, (2019) GPU implementation of ConeTorre algorithm for fluid dynamics simulation. In: Malyshkin V (ed) Parallel Computing Technologies, PaCT, (2019) Lecture Notes in Computer Science. Springer, Cham
Levchenko V, Zakirov A, Perepelkina A, (2019) LRnLA lattice boltzmann method: A performance comparison of implementations on GPU and CPU. In: L. Sokolinsky, M. Zymbler (eds.) Parallel Computational Technologies, PCT, (2019) Communications in Computer and Information Science. Springer, Cham
Margolus N (1984) Physics-like models of computation. Phys D Nonlinear Phenom 10(1–2):81–95
Article MathSciNet Google Scholar
Mattila K, Hyväluoma J, Timonen J, Rossi T (2008) Comparison of implementations of the lattice-boltzmann method. Comput Math Appl 55(7):1514–1524
Article MathSciNet Google Scholar
McCalpin JD et al (1995) Memory bandwidth and machine balance in current high performance computers. IEEE Comput Soc Tech Comm Comput Archit Newsl 2:19–25
Google Scholar
Navarro-Hinojosa O, Ruiz-Loza S, Alencastre-Miranda M (2018) Physically based visual simulation of the lattice Boltzmann method on the GPU: a survey. J Supercomput 74(7):3441–3467
Article Google Scholar
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13. IEEE
Perepelkina A, Levchenko V (2019) Enhanced asynchrony in the vectorized ConeFold algorithm for fluid dynamics modelling. Math Model 3(2):52–54
Google Scholar
Perepelkina A, Levchenko V (2019) LRnLA algorithm ConeFold with non-local vectorization for LBM implementation. In: Voevodin V, Sobolev S (eds) Supercomputing, RuSCDays 2018-Communications in Computer and Information Science. Springer, Cham
Google Scholar
Perepelkina A, Levchenko V (2020) Synchronous and asynchronous parallelism in the LRnLA algorithms. In: Sokolinsky L, Zymbler M (eds) Parallel Computational Technologies, PCT 2020: Communications in Computer and Information Science. Springer, Cham
Google Scholar
Pohl T, Kowarschik M, Wilke J, Iglberger K, Rüde U (2003) Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Process Lett 13(04):549–560
Article MathSciNet Google Scholar
Riesinger C, Bakhtiari A, Schreiber M, Neumann P, Bungartz HJ (2017) A holistic scalable implementation approach of the lattice boltzmann method for cpu/gpu heterogeneous clusters. Computation 5(4):48
Article Google Scholar
Shan X, Yuan XF, Chen H (2006) Kinetic theory representation of hydrodynamics: a way beyond the Navier-Stokes equation. J Fluid Mech 550:413–441
Article MathSciNet Google Scholar
Shimokawabe T, Endo T, Onodera N, Aoki T (2017) A stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers. Clust Comput. https://doi.org/10.1109/CLUSTER.2017.97
Article Google Scholar
Succi S (2001) The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford University Press, Oxford
MATH Google Scholar
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Wittmann M, Zeiser T, Hager G, Wellein G (2013) Comparison of different propagation steps for lattice Boltzmann methods. Comput Math Appl 65(6):924–935
Article MathSciNet Google Scholar
Zakirov A, Belousov S, Bogdanova M, Korneev B, Stepanov A, Perepelkina A, Levchenko V, Meshkov A, Potapkin B (2020) Predictive modeling of laser and electron beam powder bed fusion additive manufacturing of metals at the mesoscale. Addit Manuf. https://doi.org/10.1016/j.addma.2020.101236
Article Google Scholar
Zakirov A, Levchenko V, Perepelkina A, Zempo Y (2016) High performance FDTD algorithm for GPGPU supercomputers. In: Journal of Physics: Conference Series, vol. 759, p. 012100. IOP Publishing. https://doi.org/10.1088/1742-6596/759/1/012100

Download references

Author information

Authors and Affiliations

Kintech Lab Ltd, 12 3rd Khoroshevskaya str, Moscow, Russia, 123298
Andrey Zakirov
Keldysh Institute of Applied Mathematics, 4 Miusskaya sq, Moscow, Russia, 125047
Anastasia Perepelkina & Vadim Levchenko
Hipercone Ltd, 12 3rd Khoroshevskaya str, Moscow, Russia, 123298
Sergey Khilkov

Authors

Andrey Zakirov
View author publications
You can also search for this author in PubMed Google Scholar
Anastasia Perepelkina
View author publications
You can also search for this author in PubMed Google Scholar
Vadim Levchenko
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Khilkov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasia Perepelkina.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work is supported by the Russian Science Foundation, Grant #18-71-10004.

Appendix

To eliminate confusion, here we list some terms that are similar in their use in some context, but may have different connotations in the present text.

Lattice node—a node defined in the LBM method. Also: LBM node. It is represented with a defined coordinate \({\mathbf {x}}_i\).
DG node—dependency graph node. It is an elementary operation or several merged elementary operations. It may be assigned a position in the dD1T (d-dimensional with time) space.
DG spatial cells appear in a subdivision of the DG first in tiers, where one tier is an LBM step, then into operations connected with one lattice node (Figs. 2, 3). They contain several DG nodes and represent a process of executing the operations in these nodes.
LRnLA cells arise in a subdivision of the DG in a special manner according to the LRnLA (locally recursive non-locally asynchronous) subdivision (Fig. 4). They contain several DG nodes and represent a process of executing the operations in these nodes.
Data cell is defined for a specific LBM implementation. It is often a set of values updated in a DG spatial cell. In the current work, data cells are organized in a d-dimensional array, and contain a full set of PDF values, which may not be associated with a same time instant or a same LBM node.

One LBM update for a lattice node is the execution of (1) and (2) at that node. It is represented with one collision node and Q streaming operation nodes in the DG. In code, one LBM update for a lattice node is an update of Q values, which may be stored in one data cell. In this work, LBM update for a group of \(2^d\) LBM nodes results in an update of \(2^d\) data cells. Thus, when the number of the lattice cell updates per second (the GLUps metric) is measured, it may be counted either as LBM nodes processed, as DG cells executed, or as data cells updated.

Additionally, we use the term ‘streaming scheme’ when the data flow between operations is described (functional programming approach) and ‘algorithm’ when specific rules for data storage and access are described (imperative programming approach).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zakirov, A., Perepelkina, A., Levchenko, V. et al. Streaming techniques: revealing the natural concurrency of the lattice Boltzmann method. J Supercomput 77, 11911–11929 (2021). https://doi.org/10.1007/s11227-021-03762-z

Download citation

Accepted: 18 March 2021
Published: 31 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11227-021-03762-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Streaming techniques: revealing the natural concurrency of the lattice Boltzmann method

Abstract

Access this article

Similar content being viewed by others

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

MT-3000: a heterogeneous multi-zone processor for HPC

Quantum algorithm for the advection–diffusion equation simulated with the lattice Boltzmann method

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Streaming techniques: revealing the natural concurrency of the lattice Boltzmann method

Abstract

Access this article

Similar content being viewed by others

New Fast Methods To Compute The Number Of Primes Smaller Than A Given Value

MT-3000: a heterogeneous multi-zone processor for HPC

Quantum algorithm for the advection–diffusion equation simulated with the lattice Boltzmann method

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation