Abstract
Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple \( \mathcal{O}(N^{2})\) algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized \( \mathcal{O}(N \log N)\) algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.
Similar content being viewed by others
References
Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324:446–449
Warren MS, Salmon JK (1992) Astrophysical N-body simulations using hierarchical tree data structures. In: Supercomputing ’92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pp 570–576. IEEE Computer Society Press, Los Alamitos, CA, USA
Fukushige T, Makino J (1996) N-body simulation of galaxy formation on grape-4 special-purpose computer. In: Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p 48. IEEE Computer Society, Washington, DC, USA. doi: http://doi.acm.org/10.1145/369028.369130
Warren MS, Germann TC, Lomdahl PS, Beazley DM, Salmon JK (1998) Avalon: an alpha/linux cluster achieves 10 gflops for $15 k. In: Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pp 1–11. IEEE Computer Society, Washington, DC, USA
Kawai A, Fukushige T, Makino J (1999) $7.0 /Mflops Astrophysical N-Body Simulation with Treecode on GRAPE-5. In: Proc of Supercomputing ’99 (Gordon Bell Prize winner), pp 197–206
Makino J, Taiji M (1995) Astrophysical N-body simulations on grape-4 special-purpose computer. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p 63. ACM, New York, NY, USA. doi: http://doi.acm.org/10.1145/224170.224400
Makino J, Fukushige T, Koga M (2000) A 1.349 Tflops simulation of black holes in a galactic center on grape-6. In: Supercomputing ’00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p 43. IEEE Computer Society, Washington, DC, USA
Makino J, Kokubo E, Fukushige T (2003) Performance evaluation and tuning of grape-6 – towards 40 “real” tflops. In: SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p 2. IEEE Computer Society, Washington, DC, USA
Makino J, Kokubo E, Fukushige T, Daisaka H (2002) A 29.5 Tflops simulation of planetesimals in uranus-neptune region on grape-6. In: Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp 1–14. IEEE Computer Society Press, Los Alamitos, CA, USA
Warren MS, Salmon JK, Becker DJ, Goda MP, Sterling T (1997) Pentium Pro Inside: I. A Treecode at 430 Gflops on ASCI Red, II. Price/Performance of $50 /Mflop on Loki and Hyglac. In: Proc. Supercomputing 97, in CD-ROM. IEEE, Los Alamitos, CA
Springel V, White SDM, Jenkins A, Frenk CS, Yoshida N, Gao L, Navarro J, Thacker R, Croton D, Helly J, Peacock JA, Cole S, Thomas P, Couchman H, Evrard A, Colberg J, Pearce F (2005) Simulating the joint evolution of quasars, galaxies and their large-scale distribution. doi:10.1038/nature03597
Moore B, Diemand J, Madau P, Zemp M, Stadel J (2005) Globular clusters, satellite galaxies and stellar haloes from early dark matter peaks. doi:10.1111/j.1365-2966.2006.10116.x
Nyland L, Harris M, Prins J (2004) N-body simulations on a GPU. In: Proc of the ACM Workshop on General-Purpose Computation on Graphics Processors
Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: SIGGRAPH 2005 GPGPU COURSE. http://www.gpgpu.org/s2005/
Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: Game Developpers Conference
Zwart Portegies S, Belleman R, Geldof P (2007) High Performance Direct Gravitational N-body Simulations on Graphics Processing Unit. astro-ph/0702058
Hamada T, Iitaka T (2007) The chamomile scheme: An optimized algorithm for N-body simulations on programmable graphics processing units. http://arxiv.org/abs/astro-ph/0703100
Nyland L, Harris M, Prins J (2007) Fast N-body simulation with cuda. In: Nguyen H (ed) GPU Gems 3, chap. 31. Addison Wesley Professional
Belleman RG, Bedorf J, Zwart SP (2007) High performance direct gravitational N-body simulations on graphics processing units – ii: An implementation in cuda. doi:10.1016/j.newast.2007.07.004
Hamada T, Narumi T, Sakamaki T, Yasuoka K, Taiji M, Sagara T, Egami YKO (2008) The earliest scientific computation using cuda. In: Japan CUDA conference 2008, University of Tokyo
Barnes J (1990) A modified tree code: don’t laugh; it runs. J Computat Phys 87:161–170
Hamada T, Ohno Y, Morimoto G, Taiji M, Toshiaki I, Nitadori K (2007) Internals of the cunbody-1 library: particle/force decomposition and reduction. Princeton, NJ
Makino J (2004) A Fast Parallel Treecode with GRAPE. Publ Astron Soc Japan 56(3):521–531. http://grape.astron.s.u-tokyo.ac.jp/ makino/softwares/pC++tree
Nitadori K, Makino J, Hut P (2006) Performance tuning of N-body codes on modern microprocessors: I. direct integration with a hermite scheme on x86_64 architecture. New Astron 12:169. http://www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0511062
Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for gpu computing. In: GH ’07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pp 97–106. Eurographics Association, Aire-la-Ville, Switzerland
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hamada, T., Nitadori, K., Benkrid, K. et al. A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation . Comp. Sci. Res. Dev. 24, 21–31 (2009). https://doi.org/10.1007/s00450-009-0089-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-009-0089-1