Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators

Zhao, Wen-Long; Wang, Wu; Wang, Qiao

doi:10.1007/s11227-021-04153-0

Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators

Published: 05 November 2021

Volume 78, pages 7186–7205, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

355 Accesses
1 Citation
4 Altmetric
Explore all metrics

Abstract

Cosmological N-body simulation is associated with hyper-scale and high-resolution computing, and time will increase exponentially as the scale increases, which is always the most important issue that needs to be considered in N-body problems. With the increasing computing scale demand, high-performance computing systems and effective parallel algorithms have been applied to solve the N-body problem. PHotoNs-2, a parallel N-body simulation code designed for Lambda cold dark matter modelled simulation, was developed using the hybrid fast multipole method and particle-mesh method. In this study, PHotoNs-2 is migrated on a parallel heterogeneous CPU+Accelerator platform, which is referred to as PhotoNs-MA, and challenges are imposed on its performance by the massive data transmission, memory access, and complex mathematical functions. In this paper, the main optimizations for the kernel functions of short-range force calculations on the SIMT architecture are listed as follows: transmission of large amounts of data using page-locked memory and the structure of array to improve the efficiency of memory access, the transmission of index lists instead of particle interaction lists to reduce the transfer overhead, and using the interpolation method to replace the modified formula for interaction forces. Finally, compared with PHotoNs-2 run on 4 CPU cores, the optimized PHotoNs-MA on 4 accelerators accelerates the P2P operator 1000x times. We compared the results with Gadget-2 run on 64 CPU cores, and the overall performance is improved by 6 times for 4 accelerators. As for large-scale simulations, near-linear scalability is observed in P2P, and the parallel efficiency ultimately reaching 89.28%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization and Parallelization of the Cosmological N-Body Simulation on Many-Core Processor

Optimization of the N-Body Simulation on Intel’s Architectures Based on AVX-512 Instruction Set

Parallel optimization of Monte Carlo neutron transport method based on Sunway Bluelight II supercomputer

Article 07 April 2025

References

Angulo RE, Springel V, White SDM et al (2012) Scaling relations for galaxy clusters in the Millennium-XXL simulation. Month Notic R Astronom Soc 426(3):2046–2062. https://doi.org/10.1111/j.1365-2966.2012.21830.x
Article Google Scholar
Ishiyama T, Prada F, Klypin AA et al (2021) The Uchuu simulations: data Release 1 and dark matter halo concentrations. Month Notic R Astronom Soc 506(3):4210–4231. https://doi.org/10.1093/mnras/stab1755
Article Google Scholar
Cheng S, Yu HR, Inman D et al (2020) CUBE-Towards an Optimal Scaling of Cosmological N-body Simulations. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE 2020:685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22
Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324(6096):446–449. https://doi.org/10.1038/324446a0
Article Google Scholar
Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J Comput Phys 73(2):325–348. https://doi.org/10.1016/0021-9991(87)90140-9
Article MathSciNet MATH Google Scholar
Yahagi H, Yoshii Y (2001) N-body code with adaptive mesh refinement. Astrophys J 558(1):463–475
Article Google Scholar
Hockney RW, Eastwood JW (1988) Particle-particle-particle-mesh (P3M) algorithms. In: Computer simulation using particles, pp 267–304
Bagla JS (2002) TreePM: a code for cosmological N-body simulations. J Astrophys Astronom 23(3):185–196. https://doi.org/10.1007/BF02702282
Article Google Scholar
Ishiyama T, Fukushige T, Makino J (2009) GreeM: massively parallel TreePM code for large cosmological N-body simulations. Publicat Astronom Soc Japan 61(6):1319–1330. https://doi.org/10.1093/pasj/61.6.1319
Article Google Scholar
Ishiyama T, Nitadori K, Makino J (2012) 4.45 Pflops astrophysical N-body simulation on K computer—The gravitational trillion-body problem. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE: 1–10. https://doi.org/10.1109/SC.2012.3
Warren MS (2014) 2HOT: an improved parallel hashed oct-tree N-body algorithm for cosmological simulation. Sci Programm 22(2):109–124. https://doi.org/10.3233/SPR-140385
Article Google Scholar
Puchwein E, Baldi M, Springel V (2013) Modified-Gravity-GADGET: a new code for cosmological hydrodynamical simulations of modified gravity models. Month Notice R Astronom Soc 436(1):348–360. https://doi.org/10.1093/mnras/stt1575
Article Google Scholar
Ragagnin A, Dolag K, Wagner M et al (2020) Gadget3 on GPUs with OpenACC. arXiv preprint arXiv:2003.10850. https://doi.org/10.3233/APC200043
Jafary B, Jha S, Fiondella L et al (2021) Data-driven application-oriented reliability model of a high-performance computing system. IEEE Trans Reliab. https://doi.org/10.1109/TR.2021.3085582
Article Google Scholar
Nori M, Baldi M (2018) AX-GADGET: a new code for cosmological simulations of Fuzzy Dark Matter and Axion models. Month Notice R Astronom Soc 478(3):3935–3951. https://doi.org/10.1093/mnras/sty1224
Article Google Scholar
Wang Q, Cao ZY, Gao L et al (2018) PHoToNs-A parallel heterogeneous and threads oriented code for cosmological N-body simulation. Res Astronom Astrophys 18(6):062
Article Google Scholar
Wang Q (2021) A hybrid fast multipole method for cosmological N-body simulations. Res Astronom Astrophys 21(1):003
Article Google Scholar
Springel V, Pakmor R, Zier O et al (2021) Simulating cosmic structure formation with the GADGET-4 code. Month Notice R Astronom Soc 506(2):2871–2949. https://doi.org/10.1093/mnras/stab1855
Article Google Scholar
Habib S, Morozov V, Frontiere N et al (2013) HACC: Extreme scaling and performance across diverse architectures. In: SC’13: Proce-dings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE, pp 1–10. https://doi.org/10.1145/2503210.2504566
Belleman RG, Bédorf J, Zwart SFP (2008) High performance direct gravitational N-body simulations on graphics processing units II: an implementation in CUDA. New Astronom 13(2):103–112. https://doi.org/10.1016/j.newast.2007.07.004
Article Google Scholar
Nylons L, Harris M, Prins J (2007) Fast n-body simulation with CUDA. In: GPU Gems 3, vol. 24. Addison Wesley, Boston, pp 62–66
Yokota R, Barba LA (2011) Treecode and fast multipole method for N-body simulation with CUDA. In: Wen-mei WH (ed) GPU Computing Gems Emerald Edition, Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-384988-5.00009-7
Hamada T, Iitaka T (2007) The chamomile scheme: an optimized algorithm for n-body simulations on programmable graphics processing units. arXiv:astro-ph/0703100
Hamada T, Nitadori K (2010) 190 tflops astrophysical n-body simulation on a cluster of gpus. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis: 1–9. https://doi.org/10.1109/SC.2010.1
Hamada T, Nitadori K, Benkrid K et al (2009) A novel multiple-walk parallel algorithm for the Barnes-Hut treecode on GPUs-towards cost effective, high performance N-body simulation. Compu Sci Res Develop 24(1–2):21–31
Article Google Scholar
Hamada T, Narumi T, Yokota R et al (2009) 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/1654059.1654123
Potter D, Stadel J, Teyssier R (2017) PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys. Comput Astrophys Cosmol 4(1):1–13. https://doi.org/10.1186/s40668-017-0021-1
Article Google Scholar
Gumerov NA, Duraiswami R (2008) Fast multipole methods on graphics processors. J Comput Phys 227(18):8290–8313. https://doi.org/10.1016/j.jcp.2008.05.023
Article MathSciNet MATH Google Scholar
Gaburov E, Bédorf J, Zwart SP (2010) Gravitational tree-code on graphics processing units: implementation in CUDA. Proc Comput Sci 1(1):1119–1127. https://doi.org/10.1016/j.procs.2010.04.124
Article Google Scholar
Bédorf J, Gaburov E, Zwart SP (2012) A sparse octree gravitational N-body code that runs entirely on the GPU processor. J Comput Phys 231(7):2825–2839. https://doi.org/10.1016/j.jcp.2011.12.024
Article MathSciNet MATH Google Scholar
Goldfarb M, Jo Y, Kulkarni M (2013) General transformations for GPU execution of tree traversals. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/2503210.2503223
Soderquist P, Leeser M (1996) Area and performance tradeoffs in floating-point divide and square-root implementations. ACM Comput Surv (CSUR) 28(3):518–564. https://doi.org/10.1145/243439.243481
Article Google Scholar
Wang Q, Meng C (2021) PHotoNs-GPU: a GPU accelerated cosmological simulation code. arXiv preprint arXiv:2107.14008
Kuznetsov E, Stegailov V (2019) Porting CUDA-based molecular dynamics algorithms to AMD ROCm platform using hip framework: performance analysis. russian supercomputing days. Springer, Cham, pp 121–130. https://doi.org/10.1007/978-3-030-36592-9_11
Book Google Scholar
Greengard L, Lee JY (1996) A direct adaptive Poisson solver of arbitrary order accuracy. J Computat Phys 125(2):415–424. https://doi.org/10.1006/jcph.1996.0103
Article MathSciNet MATH Google Scholar
Bode P, Ostriker JP, Xu G (2000) The tree particle-mesh N-body gravity solver. Astrophys J Suppl Ser 128(2):561
Article Google Scholar
Li N, Laizet S (2010) 2decomp & fft-a highly scalable 2d decomposition library and fft interface. In: Cray User Group 2010 Conference, pp. 1–13
Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units: 46-51. https://doi.org/10.1145/1513895.1513901
AMD (2020) AMD ROCm Platform. https://rocmdocs.amd.com/en/latest/index.html. Accessed 18 Sep 2021
Hundt C, Martinez M (2021) Memory Layouts and Memory Pools. https://developer.nvidia.com/blog. Accessed 18 Sep 2021
NVIDIA(2012) How to optimize data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc. Accessed 18 Sep 2021
NVIDIA(2012) How to overlap data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc. Accessed 18 Sep 2021
Farber R (2011) CUDA application design and development. Elsevier
Google Scholar
Arafa Y, Badawy A H A, Chennupati G, et al (2019) Low overhead instruction latency characterization for nvidia gpgpus. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, pp. 1–8. https://doi.org/10.1109/HPEC.2019.8916466
Crocce M, Pueblas S, Scoccimarro R (2006) Transients from initial conditions in cosmological simulations. Month Notices R Astronom Soc 373(1):369–381. https://doi.org/10.1111/j.1365-2966.2006.11040.x
Article Google Scholar
Yu HR, Emberson JD, Inman D et al (2017) Differential neutrino condensation onto cosmic structure. Nature Astronom 1(7):1–5. https://doi.org/10.1038/s41550-017-0143
Article Google Scholar

Download references

Acknowledgements

This paper was supported by the National Key R&D Program for Developing Basic Sciences (Grant Nos.2020YFB0204802), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No.XDC01000000) and GHFUND A No.20210701. The numerical calculation in the paper was carried out on CAS Xiandao-1 computing environment.

Author information

Authors and Affiliations

Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China
Wen-Long Zhao & Wu Wang
University of Chinese Academy of Sciences, Beijing, 100049, China
Wen-Long Zhao & Qiao Wang
Key Laboratory for Computational Astrophysics, National Astronomical Observatories, Chinese Academy of Sciense, Beijing, 100101, China
Qiao Wang

Authors

Wen-Long Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Wu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Qiao Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wu Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, WL., Wang, W. & Wang, Q. Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators. J Supercomput 78, 7186–7205 (2022). https://doi.org/10.1007/s11227-021-04153-0

Download citation

Accepted: 13 October 2021
Published: 05 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11227-021-04153-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimization and Parallelization of the Cosmological N-Body Simulation on Many-Core Processor

Optimization of the N-Body Simulation on Intel’s Architectures Based on AVX-512 Instruction Set

Parallel optimization of Monte Carlo neutron transport method based on Sunway Bluelight II supercomputer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now