Skip to main content
Log in

Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Cosmological N-body simulation is associated with hyper-scale and high-resolution computing, and time will increase exponentially as the scale increases, which is always the most important issue that needs to be considered in N-body problems. With the increasing computing scale demand, high-performance computing systems and effective parallel algorithms have been applied to solve the N-body problem. PHotoNs-2, a parallel N-body simulation code designed for Lambda cold dark matter modelled simulation, was developed using the hybrid fast multipole method and particle-mesh method. In this study, PHotoNs-2 is migrated on a parallel heterogeneous CPU+Accelerator platform, which is referred to as PhotoNs-MA, and challenges are imposed on its performance by the massive data transmission, memory access, and complex mathematical functions. In this paper, the main optimizations for the kernel functions of short-range force calculations on the SIMT architecture are listed as follows: transmission of large amounts of data using page-locked memory and the structure of array to improve the efficiency of memory access, the transmission of index lists instead of particle interaction lists to reduce the transfer overhead, and using the interpolation method to replace the modified formula for interaction forces. Finally, compared with PHotoNs-2 run on 4 CPU cores, the optimized PHotoNs-MA on 4 accelerators accelerates the P2P operator 1000x times. We compared the results with Gadget-2 run on 64 CPU cores, and the overall performance is improved by 6 times for 4 accelerators. As for large-scale simulations, near-linear scalability is observed in P2P, and the parallel efficiency ultimately reaching 89.28%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Angulo RE, Springel V, White SDM et al (2012) Scaling relations for galaxy clusters in the Millennium-XXL simulation. Month Notic R Astronom Soc 426(3):2046–2062. https://doi.org/10.1111/j.1365-2966.2012.21830.x

    Article  Google Scholar 

  2. Ishiyama T, Prada F, Klypin AA et al (2021) The Uchuu simulations: data Release 1 and dark matter halo concentrations. Month Notic R Astronom Soc 506(3):4210–4231. https://doi.org/10.1093/mnras/stab1755

    Article  Google Scholar 

  3. Cheng S, Yu HR, Inman D et al (2020) CUBE-Towards an Optimal Scaling of Cosmological N-body Simulations. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE 2020:685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22

  4. Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324(6096):446–449. https://doi.org/10.1038/324446a0

    Article  Google Scholar 

  5. Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J Comput Phys 73(2):325–348. https://doi.org/10.1016/0021-9991(87)90140-9

    Article  MathSciNet  MATH  Google Scholar 

  6. Yahagi H, Yoshii Y (2001) N-body code with adaptive mesh refinement. Astrophys J 558(1):463–475

    Article  Google Scholar 

  7. Hockney RW, Eastwood JW (1988) Particle-particle-particle-mesh (P3M) algorithms. In: Computer simulation using particles, pp 267–304

  8. Bagla JS (2002) TreePM: a code for cosmological N-body simulations. J Astrophys Astronom 23(3):185–196. https://doi.org/10.1007/BF02702282

    Article  Google Scholar 

  9. Ishiyama T, Fukushige T, Makino J (2009) GreeM: massively parallel TreePM code for large cosmological N-body simulations. Publicat Astronom Soc Japan 61(6):1319–1330. https://doi.org/10.1093/pasj/61.6.1319

    Article  Google Scholar 

  10. Ishiyama T, Nitadori K, Makino J (2012) 4.45 Pflops astrophysical N-body simulation on K computer—The gravitational trillion-body problem. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE: 1–10. https://doi.org/10.1109/SC.2012.3

  11. Warren MS (2014) 2HOT: an improved parallel hashed oct-tree N-body algorithm for cosmological simulation. Sci Programm 22(2):109–124. https://doi.org/10.3233/SPR-140385

    Article  Google Scholar 

  12. Puchwein E, Baldi M, Springel V (2013) Modified-Gravity-GADGET: a new code for cosmological hydrodynamical simulations of modified gravity models. Month Notice R Astronom Soc 436(1):348–360. https://doi.org/10.1093/mnras/stt1575

    Article  Google Scholar 

  13. Ragagnin A, Dolag K, Wagner M et al (2020) Gadget3 on GPUs with OpenACC. arXiv preprint arXiv:2003.10850. https://doi.org/10.3233/APC200043

  14. Jafary B, Jha S, Fiondella L et al (2021) Data-driven application-oriented reliability model of a high-performance computing system. IEEE Trans Reliab. https://doi.org/10.1109/TR.2021.3085582

    Article  Google Scholar 

  15. Nori M, Baldi M (2018) AX-GADGET: a new code for cosmological simulations of Fuzzy Dark Matter and Axion models. Month Notice R Astronom Soc 478(3):3935–3951. https://doi.org/10.1093/mnras/sty1224

    Article  Google Scholar 

  16. Wang Q, Cao ZY, Gao L et al (2018) PHoToNs-A parallel heterogeneous and threads oriented code for cosmological N-body simulation. Res Astronom Astrophys 18(6):062

    Article  Google Scholar 

  17. Wang Q (2021) A hybrid fast multipole method for cosmological N-body simulations. Res Astronom Astrophys 21(1):003

    Article  Google Scholar 

  18. Springel V, Pakmor R, Zier O et al (2021) Simulating cosmic structure formation with the GADGET-4 code. Month Notice R Astronom Soc 506(2):2871–2949. https://doi.org/10.1093/mnras/stab1855

    Article  Google Scholar 

  19. Habib S, Morozov V, Frontiere N et al (2013) HACC: Extreme scaling and performance across diverse architectures. In: SC’13: Proce-dings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE, pp 1–10. https://doi.org/10.1145/2503210.2504566

  20. Belleman RG, Bédorf J, Zwart SFP (2008) High performance direct gravitational N-body simulations on graphics processing units II: an implementation in CUDA. New Astronom 13(2):103–112. https://doi.org/10.1016/j.newast.2007.07.004

    Article  Google Scholar 

  21. Nylons L, Harris M, Prins J (2007) Fast n-body simulation with CUDA. In: GPU Gems 3, vol. 24. Addison Wesley, Boston, pp 62–66

  22. Yokota R, Barba LA (2011) Treecode and fast multipole method for N-body simulation with CUDA. In: Wen-mei WH (ed) GPU Computing Gems Emerald Edition, Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-384988-5.00009-7

  23. Hamada T, Iitaka T (2007) The chamomile scheme: an optimized algorithm for n-body simulations on programmable graphics processing units. arXiv:astro-ph/0703100

  24. Hamada T, Nitadori K (2010) 190 tflops astrophysical n-body simulation on a cluster of gpus. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis: 1–9. https://doi.org/10.1109/SC.2010.1

  25. Hamada T, Nitadori K, Benkrid K et al (2009) A novel multiple-walk parallel algorithm for the Barnes-Hut treecode on GPUs-towards cost effective, high performance N-body simulation. Compu Sci Res Develop 24(1–2):21–31

    Article  Google Scholar 

  26. Hamada T, Narumi T, Yokota R et al (2009) 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/1654059.1654123

  27. Potter D, Stadel J, Teyssier R (2017) PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys. Comput Astrophys Cosmol 4(1):1–13. https://doi.org/10.1186/s40668-017-0021-1

    Article  Google Scholar 

  28. Gumerov NA, Duraiswami R (2008) Fast multipole methods on graphics processors. J Comput Phys 227(18):8290–8313. https://doi.org/10.1016/j.jcp.2008.05.023

    Article  MathSciNet  MATH  Google Scholar 

  29. Gaburov E, Bédorf J, Zwart SP (2010) Gravitational tree-code on graphics processing units: implementation in CUDA. Proc Comput Sci 1(1):1119–1127. https://doi.org/10.1016/j.procs.2010.04.124

    Article  Google Scholar 

  30. Bédorf J, Gaburov E, Zwart SP (2012) A sparse octree gravitational N-body code that runs entirely on the GPU processor. J Comput Phys 231(7):2825–2839. https://doi.org/10.1016/j.jcp.2011.12.024

    Article  MathSciNet  MATH  Google Scholar 

  31. Goldfarb M, Jo Y, Kulkarni M (2013) General transformations for GPU execution of tree traversals. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/2503210.2503223

  32. Soderquist P, Leeser M (1996) Area and performance tradeoffs in floating-point divide and square-root implementations. ACM Comput Surv (CSUR) 28(3):518–564. https://doi.org/10.1145/243439.243481

    Article  Google Scholar 

  33. Wang Q, Meng C (2021) PHotoNs-GPU: a GPU accelerated cosmological simulation code. arXiv preprint arXiv:2107.14008

  34. Kuznetsov E, Stegailov V (2019) Porting CUDA-based molecular dynamics algorithms to AMD ROCm platform using hip framework: performance analysis. russian supercomputing days. Springer, Cham, pp 121–130. https://doi.org/10.1007/978-3-030-36592-9_11

    Book  Google Scholar 

  35. Greengard L, Lee JY (1996) A direct adaptive Poisson solver of arbitrary order accuracy. J Computat Phys 125(2):415–424. https://doi.org/10.1006/jcph.1996.0103

    Article  MathSciNet  MATH  Google Scholar 

  36. Bode P, Ostriker JP, Xu G (2000) The tree particle-mesh N-body gravity solver. Astrophys J Suppl Ser 128(2):561

    Article  Google Scholar 

  37. Li N, Laizet S (2010) 2decomp & fft-a highly scalable 2d decomposition library and fft interface. In: Cray User Group 2010 Conference, pp. 1–13

  38. Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units: 46-51. https://doi.org/10.1145/1513895.1513901

  39. AMD (2020) AMD ROCm Platform. https://rocmdocs.amd.com/en/latest/index.html. Accessed 18 Sep 2021

  40. Hundt C, Martinez M (2021) Memory Layouts and Memory Pools. https://developer.nvidia.com/blog. Accessed 18 Sep 2021

  41. NVIDIA(2012) How to optimize data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc. Accessed 18 Sep 2021

  42. NVIDIA(2012) How to overlap data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc. Accessed 18 Sep 2021

  43. Farber R (2011) CUDA application design and development. Elsevier

    Google Scholar 

  44. Arafa Y, Badawy A H A, Chennupati G, et al (2019) Low overhead instruction latency characterization for nvidia gpgpus. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, pp. 1–8. https://doi.org/10.1109/HPEC.2019.8916466

  45. Crocce M, Pueblas S, Scoccimarro R (2006) Transients from initial conditions in cosmological simulations. Month Notices R Astronom Soc 373(1):369–381. https://doi.org/10.1111/j.1365-2966.2006.11040.x

    Article  Google Scholar 

  46. Yu HR, Emberson JD, Inman D et al (2017) Differential neutrino condensation onto cosmic structure. Nature Astronom 1(7):1–5. https://doi.org/10.1038/s41550-017-0143

    Article  Google Scholar 

Download references

Acknowledgements

This paper was supported by the National Key R&D Program for Developing Basic Sciences (Grant Nos.2020YFB0204802), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No.XDC01000000) and GHFUND A No.20210701. The numerical calculation in the paper was carried out on CAS Xiandao-1 computing environment.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wu Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, WL., Wang, W. & Wang, Q. Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators. J Supercomput 78, 7186–7205 (2022). https://doi.org/10.1007/s11227-021-04153-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04153-0

Keywords

Navigation