skip to main content
10.1145/3627535.3638479acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Extreme-scale Direct Numerical Simulation of Incompressible Turbulence on the Heterogeneous Many-core System

Published: 20 February 2024 Publication History

Abstract

Direct numerical simulation (DNS) is a technique that directly solves the fluid Navier-Stokes equations with high spatial and temporal resolutions, which has driven much research regarding the nature of turbulence. For high-Reynolds number (Re) incompressible turbulence of particular interest, where the nondimensional Re characterizes the flow regime, the application of DNS is hindered by the fact that the numerical grid size (i.e., the memory requirement) scales with Re3, while the overall computational cost scales with Re4. Recent studies have shown that developing efficient parallel methods for heterogeneous many-core systems is promising to solve this computational challenge.
We develop PowerLLEL++, a high-performance and scalable implicit finite difference solver for heterogeneous many-core systems, to accelerate the extreme-scale DNS of incompressible turbulence. To achieve this goal, an adaptive multi-level parallelization strategy is first proposed to fully exploit the multi-level parallelism and computing power of heterogeneous many-core systems. Second, hierarchical-memory-adapted data reuse/tiling strategy and kernel fusion are adopted to improve the performance of memory-bound stencil-like operations. Third, a parallel tridiagonal solver based on the parallel diagonal dominant (PDD) algorithm is developed to minimize the number of global data transposes. Fourth, three effective communication optimizations are implemented by Remote Direct Memory Access (RDMA) to maximize the performance of the remaining global transposes and halo exchange.
Results show that the solver exploits the heterogeneous computing power of the new Tianhe supercomputer and achieves a speedup of up to 10.6× (against the CPU-only performance). Linear strong scaling is obtained with a grid size of up to 25.8 billion.

References

[1]
Alan Ayala, Stanimire Tomov, Azzam Haidar, and Jack Dongarra. 2020. heFFTe: Highly Efficient FFT for Exascale. In Computational Science - ICCS 2020 (Lecture Notes in Computer Science), Valeria V. Krzhizhanovskaya, Gábor Závodszky, Michael H. Lees, Jack J. Dongarra, Peter M. A. Sloot, Sérgio Brissos, and João Teixeira (Eds.). Springer International Publishing, Cham, 262--275.
[2]
Seonmyeong Bak, Colleen Bertoni, Swen Boehm, Reuben Budiardja, Barbara M. Chapman, Johannes Doerfert, Markus Eisenbach, Hal Finkel, Oscar Hernandez, Joseph Huber, Shintaro Iwasaki, Vivek Kale, Paul R. C. Kent, JaeHyuk Kwack, Meifeng Lin, Piotr Luszczek, Ye Luo, Buu Pham, Swaroop Pophale, Kiran Ravikumar, Vivek Sarkar, Thomas Scogland, Shilei Tian, and P. K. Yeung. 2022. OpenMP application experiences: Porting to accelerated nodes. Parallel Comput. 109 (2022). Section: 102856.
[3]
Dotan Barak. 2015. RDMA aware networks programming user manual. (2015).
[4]
Matteo Bernardini, Sergio Pirozzoli, and Paolo Orlandi. 2014. Velocity statistics in turbulent channel flow up to Reτ =4000. Journal of Fluid Mechanics 742 (2014), 171--191. Edition: 2014/02/21.
[5]
Pedro Costa. 2018. A FFT-based finite-difference solver for massively-parallel direct numerical simulations of turbulent flows. Computers & Mathematics with Applications 76, 8 (Oct. 2018), 1853--1862.
[6]
Pedro Costa, Everett Phillips, Luca Brandt, and Massimiliano Fatica. 2021. GPU acceleration of CaNS for massively-parallel direct numerical simulations of canonical fluid flows. Computers & Mathematics with Applications 81 (2021), 502--511. Section: 502.
[7]
Richard Phillips Feynman, Robert B Leighton, Matthew Sands, et al. 1971. The Feynman lectures on physics. Vol. 1. Addison-Wesley Reading, MA.
[8]
Amir Gholami, Judith Hill, Dhairya Malhotra, and George Biros. 2016. AccFFT: A library for distributed-memory FFT on CPU and GPU architectures. (2016). http://arxiv.org/abs/1506.07933
[9]
Sergio Hoyas, Martin Oberlack, Francisco Alcántara-Ávila, Stefanie V. Kraheberger, and Jonathan Laux. 2022. Wall turbulence at high friction Reynolds numbers. Physical Review Fluids 7, 1 (2022).
[10]
J. Kim, P. Moin, and R. Moser. 1987. Turbulence statistics in fully developed channel flow at low reynolds number. Journal of Fluid Mechanics 177 (1987), 133--166.
[11]
Sylvain Laizet and Ning Li. 2011. Incompact3d: A powerful tool to tackle turbulence problems with up to O(105) computational cores. International Journal for Numerical Methods in Fluids 67, 11 (2011), 1735--1757. Section: 1735.
[12]
Myoungkyu Lee, Nicholas Malaya, and Robert D. Moser. 2013. Petascale direct numerical simulation of turbulent channel flow on up to 786K cores. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). Association for Computing Machinery, New York, NY, USA, 1--11.
[13]
Myoungkyu Lee and Robert D. Moser. 2015. Direct numerical simulation of turbulent channel flow up to Reτ ≈ 5200. Journal of Fluid Mechanics 774 (2015), 395--415. Edition: 2015/06/10.
[14]
Ning Li and Sylvain Laizet. 2010. 2DECOMP&FFT - A highly scalable 2D decomposition library and FFT interface. Edinburgh, 1--13.
[15]
Runhua Li, Jie Liu, Guangchun Zhang, Chunye Gong, Bo Yang, and Yuechao Liang. 2023. An Efficient Heterogenous Parallel Algorithm of the 3D MOC for Multizone Heterogeneous Systems. Computer Physics Communications (June 2023), 108806.
[16]
Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High Performance Interconnect Network for Tianhe System. Journal of Computer Science and Technology 30, 2 (March 2015), 259--272.
[17]
Yong (Alexander) Liu, Xin (Lucy) Liu, Fang (Nancy) Li, Haohuan Fu, Yuling Yang, Jiawei Song, Pengpeng Zhao, Zhen Wang, Dajia Peng, Huarong Chen, Chu Guo, Heliang Huang, Wenzhao Wu, and Dexun Chen. 2021. Closing the "quantum supremacy" gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, 1--12.
[18]
Zhao Liu, XueSen Chu, Xiaojing Lv, Hongsong Meng, Shupeng Shi, Wenji Han, Jingheng Xu, Haohuan Fu, and Guangwen Yang. 2019. SunwayLB: Enabling Extreme-Scale Lattice Boltzmann Method Based Computing Fluid Dynamics Simulations on Sunway TaihuLight. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (IPDPS '19). 557--566. ISSN: 1530-2075.
[19]
Kai Lu, Yaohua Wang, Yang Guo, Chun Huang, Sheng Liu, Ruibo Wang, Jianbin Fang, Tao Tang, Zhaoyun Chen, Biwei Liu, Zhong Liu, Yuanwu Lei, and Haiyan Sun. 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Transactions on High Performance Computing 4, 2 (June 2022), 150--164.
[20]
Dmitry Pekurovsky. 2012. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing 34, 4 (2012), C192--C209. Section: C192.
[21]
Michael Pippig. 2013. PFFT: An Extension of FFTW to Massively Parallel Architectures. SIAM Journal on Scientific Computing 35, 3 (2013), C213--C236. Section: C213.
[22]
Sergio Pirozzoli, Joshua Romero, Massimiliano Fatica, Roberto Verzicco, and Paolo Orlandi. 2021. One-point statistics for turbulent pipe flow up to Reτ = 6000. Journal of Fluid Mechanics 926 (2021), A28. Edition: 2021/09/10.
[23]
Kiran Ravikumar, David Appelhans, and P. K. Yeung. 2019. GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Association for Computing Machinery, New York, NY, USA, 1--22.
[24]
Joshua Romero, Pedro Costa, and Massimiliano Fatica. 2022. Distributed-memory simulations of turbulent flows on modern GPU systems using an adaptive pencil decomposition library. 1--11.
[25]
Sukhyun Song and Jeffrey K. Hollingsworth. 2016. Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT. Journal of Computational Science 14 (2016), 38--50. Section: 38.
[26]
Xian-He Sun, Hong Zhang Sun, and Lionel M. Ni. 1989. Parallel algorithms for solution of tridiagonal systems on multicomputers. Association for Computing Machinery, Crete, Greece, 303--312.
[27]
Erwin P. van der Poel, Rodolfo Ostilla-Mönico, John Donners, and Roberto Verzicco. 2015. A pencil distributed finite difference code for strongly turbulent wall-bounded flows. Computers & Fluids 116 (Aug. 2015), 10--16.
[28]
Jiabin Xie, Jianchao He, Yun Bao, and Xi Chen. 2021. A Low-Communication-Overhead Parallel DNS Method for the 3D Incompressible Wall Turbulence. International Journal of Computational Fluid Dynamics 35, 6 (2021), 413--432. Section: 413.
[29]
Quanyong Xu, Hu Ren, Hanfeng Gu, Jie Wu, Jingyuan Wang, Zhifeng Xie, and Guangwen Yang. 2023. Large-Scale Simulation of Full Three-Dimensional Flow and Combustion of an Aero-Turbofan Engine on Sunway TaihuLight Supercomputer. Entropy 25, 3 (March 2023), 436. Number: 3 Publisher: Multidisciplinary Digital Publishing Institute.
[30]
Yoshinobu Yamamoto and Yoshiyuki Tsuji. 2018. Numerical evidence of logarithmic regions in channel flow at Reτ = 8000. Physical Review Fluids 3, 1 (2018), 012602.
[31]
Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, Guangwen Yang, and Weimin Zheng. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Salt Lake City, Utah, 1--12.
[32]
Jin Yang, Wangdong Yang, Ruixuan Qi, Qinyun Tsai, Shengle Lin, Fengkun Dong, Kenli Li, and Keqin Li. 2023. Parallel algorithm design and optimization of geodynamic numerical simulation application on the Tianhe new-generation high-performance computer. The Journal of Supercomputing (June 2023).
[33]
Qingyang Zhang, Lei Xu, Rongliang Chen, Lin Chen, Xinhai Chen, Qinglin Wang, Jie Liu, and Bo Yang. 2023. Improving the Performance of Lattice Boltzmann Method with Pipelined Algorithm on A Heterogeneous Multi-zone Processor. In Parallel and Distributed Computing, Applications and Technologies (Lecture Notes in Computer Science), Hiroyuki Takizawa, Hong Shen, Toshihiro Hanawa, Jong Hyuk Park, Hui Tian, and Ryusuke Egawa (Eds.). Springer Nature Switzerland, Cham, 28--41.
[34]
Tuowen Zhao, Mary Hall, Hans Johansen, and Samuel Williams. 2021. Improving communication by optimizing on-node data movement with data layout. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 304--317.
[35]
Xiaojue Zhu, Everett Phillips, Vamsi Spandan, John Donners, Gregory Ruetsch, Joshua Romero, Rodolfo Ostilla-Mönico, Yantao Yang, Detlef Lohse, Roberto Verzicco, Massimiliano Fatica, and Richard J. A. M. Stevens. 2018. AFiD-GPU: A versatile Navier-Stokes solver for wall-bounded turbulent flows on GPU clusters. Computer Physics Communications 229 (2018), 199--210.

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)UNR: Unified Notifiable RMA Library for HPCSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
  • (2024)Pipe-AGCM: A Fine-Grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation ModelEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_20(283-297)Online publication date: 26-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
March 2024
498 pages
ISBN:9798400704352
DOI:10.1145/3627535
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. incompressible turbulence
  2. direct numerical simulation
  3. heterogeneous
  4. many-core
  5. high performance computing

Qualifiers

  • Research-article

Funding Sources

  • The National Key Research and Development Program of China
  • The Major Program of Guangdong Basic and Applied Research
  • Guangdong Province Special Support Program for Cultivating High- Level Talents

Conference

PPoPP '24

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)381
  • Downloads (Last 6 weeks)25
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)UNR: Unified Notifiable RMA Library for HPCSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00111(1-15)Online publication date: 17-Nov-2024
  • (2024)Pipe-AGCM: A Fine-Grain Pipelining Scheme for Optimizing the Parallel Atmospheric General Circulation ModelEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_20(283-297)Online publication date: 26-Aug-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media