Skip to main content
Log in

A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple \( \mathcal{O}(N^{2})\) algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized \( \mathcal{O}(N \log N)\) algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324:446–449

    Article  Google Scholar 

  2. Warren MS, Salmon JK (1992) Astrophysical N-body simulations using hierarchical tree data structures. In: Supercomputing ’92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pp 570–576. IEEE Computer Society Press, Los Alamitos, CA, USA

  3. Fukushige T, Makino J (1996) N-body simulation of galaxy formation on grape-4 special-purpose computer. In: Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p 48. IEEE Computer Society, Washington, DC, USA. doi: http://doi.acm.org/10.1145/369028.369130

  4. Warren MS, Germann TC, Lomdahl PS, Beazley DM, Salmon JK (1998) Avalon: an alpha/linux cluster achieves 10 gflops for $15 k. In: Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pp 1–11. IEEE Computer Society, Washington, DC, USA

  5. Kawai A, Fukushige T, Makino J (1999) $7.0 /Mflops Astrophysical N-Body Simulation with Treecode on GRAPE-5. In: Proc of Supercomputing ’99 (Gordon Bell Prize winner), pp 197–206

  6. Makino J, Taiji M (1995) Astrophysical N-body simulations on grape-4 special-purpose computer. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p 63. ACM, New York, NY, USA. doi: http://doi.acm.org/10.1145/224170.224400

  7. Makino J, Fukushige T, Koga M (2000) A 1.349 Tflops simulation of black holes in a galactic center on grape-6. In: Supercomputing ’00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p 43. IEEE Computer Society, Washington, DC, USA

  8. Makino J, Kokubo E, Fukushige T (2003) Performance evaluation and tuning of grape-6 – towards 40 “real” tflops. In: SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p 2. IEEE Computer Society, Washington, DC, USA

  9. Makino J, Kokubo E, Fukushige T, Daisaka H (2002) A 29.5 Tflops simulation of planetesimals in uranus-neptune region on grape-6. In: Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp 1–14. IEEE Computer Society Press, Los Alamitos, CA, USA

  10. Warren MS, Salmon JK, Becker DJ, Goda MP, Sterling T (1997) Pentium Pro Inside: I. A Treecode at 430 Gflops on ASCI Red, II. Price/Performance of $50 /Mflop on Loki and Hyglac. In: Proc. Supercomputing 97, in CD-ROM. IEEE, Los Alamitos, CA

  11. Springel V, White SDM, Jenkins A, Frenk CS, Yoshida N, Gao L, Navarro J, Thacker R, Croton D, Helly J, Peacock JA, Cole S, Thomas P, Couchman H, Evrard A, Colberg J, Pearce F (2005) Simulating the joint evolution of quasars, galaxies and their large-scale distribution. doi:10.1038/nature03597

  12. Moore B, Diemand J, Madau P, Zemp M, Stadel J (2005) Globular clusters, satellite galaxies and stellar haloes from early dark matter peaks. doi:10.1111/j.1365-2966.2006.10116.x

  13. Nyland L, Harris M, Prins J (2004) N-body simulations on a GPU. In: Proc of the ACM Workshop on General-Purpose Computation on Graphics Processors

  14. Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: SIGGRAPH 2005 GPGPU COURSE. http://www.gpgpu.org/s2005/

  15. Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: Game Developpers Conference

  16. Zwart Portegies S, Belleman R, Geldof P (2007) High Performance Direct Gravitational N-body Simulations on Graphics Processing Unit. astro-ph/0702058

  17. Hamada T, Iitaka T (2007) The chamomile scheme: An optimized algorithm for N-body simulations on programmable graphics processing units. http://arxiv.org/abs/astro-ph/0703100

  18. Nyland L, Harris M, Prins J (2007) Fast N-body simulation with cuda. In: Nguyen H (ed) GPU Gems 3, chap. 31. Addison Wesley Professional

  19. Belleman RG, Bedorf J, Zwart SP (2007) High performance direct gravitational N-body simulations on graphics processing units – ii: An implementation in cuda. doi:10.1016/j.newast.2007.07.004

  20. Hamada T, Narumi T, Sakamaki T, Yasuoka K, Taiji M, Sagara T, Egami YKO (2008) The earliest scientific computation using cuda. In: Japan CUDA conference 2008, University of Tokyo

  21. Barnes J (1990) A modified tree code: don’t laugh; it runs. J Computat Phys 87:161–170

    Article  MATH  MathSciNet  Google Scholar 

  22. Hamada T, Ohno Y, Morimoto G, Taiji M, Toshiaki I, Nitadori K (2007) Internals of the cunbody-1 library: particle/force decomposition and reduction. Princeton, NJ

  23. Makino J (2004) A Fast Parallel Treecode with GRAPE. Publ Astron Soc Japan 56(3):521–531. http://grape.astron.s.u-tokyo.ac.jp/ makino/softwares/pC++tree

  24. Nitadori K, Makino J, Hut P (2006) Performance tuning of N-body codes on modern microprocessors: I. direct integration with a hermite scheme on x86_64 architecture. New Astron 12:169. http://www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0511062

  25. Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for gpu computing. In: GH ’07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pp 97–106. Eurographics Association, Aire-la-Ville, Switzerland

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tsuyoshi Hamada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hamada, T., Nitadori, K., Benkrid, K. et al. A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation . Comp. Sci. Res. Dev. 24, 21–31 (2009). https://doi.org/10.1007/s00450-009-0089-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-009-0089-1

Keywords

Navigation