A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

Hamada, Tsuyoshi; Nitadori, Keigo; Benkrid, Khaled; Ohno, Yousuke; Morimoto, Gentaro; Masada, Tomonari; Shibata, Yuichiro; Oguri, Kiyoshi; Taiji, Makoto

doi:10.1007/s00450-009-0089-1

A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

Special Issue Paper
Published: 20 May 2009

Volume 24, pages 21–31, (2009)
Cite this article

Computer Science - Research and Development

Tsuyoshi Hamada¹,
Keigo Nitadori²,
Khaled Benkrid³,
Yousuke Ohno⁴,
Gentaro Morimoto⁴,
Tomonari Masada¹,
Yuichiro Shibata¹,
Kiyoshi Oguri¹ &
…
Makoto Taiji⁴

480 Accesses
21 Citations
3 Altmetric
Explore all metrics

Abstract

Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple $ \mathcal{O}(N^{2})$ algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized $ \mathcal{O}(N \log N)$ algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Strategy to Workload Division for Massively Particle-Particle N-body Simulations on GPUs

Increasing Parallelism and Reducing Thread Contentions in Mapping Localized N-Body Simulations to GPUs

Up to 700k GPU Cores, Kepler, and the Exascale Future for Simulations of Star Clusters Around Black Holes

References

Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324:446–449
Article Google Scholar
Warren MS, Salmon JK (1992) Astrophysical N-body simulations using hierarchical tree data structures. In: Supercomputing ’92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pp 570–576. IEEE Computer Society Press, Los Alamitos, CA, USA
Fukushige T, Makino J (1996) N-body simulation of galaxy formation on grape-4 special-purpose computer. In: Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p 48. IEEE Computer Society, Washington, DC, USA. doi: http://doi.acm.org/10.1145/369028.369130
Warren MS, Germann TC, Lomdahl PS, Beazley DM, Salmon JK (1998) Avalon: an alpha/linux cluster achieves 10 gflops for $15 k. In: Supercomputing ’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pp 1–11. IEEE Computer Society, Washington, DC, USA
Kawai A, Fukushige T, Makino J (1999) $7.0 /Mflops Astrophysical N-Body Simulation with Treecode on GRAPE-5. In: Proc of Supercomputing ’99 (Gordon Bell Prize winner), pp 197–206
Makino J, Taiji M (1995) Astrophysical N-body simulations on grape-4 special-purpose computer. In: Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p 63. ACM, New York, NY, USA. doi: http://doi.acm.org/10.1145/224170.224400
Makino J, Fukushige T, Koga M (2000) A 1.349 Tflops simulation of black holes in a galactic center on grape-6. In: Supercomputing ’00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p 43. IEEE Computer Society, Washington, DC, USA
Makino J, Kokubo E, Fukushige T (2003) Performance evaluation and tuning of grape-6 – towards 40 “real” tflops. In: SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p 2. IEEE Computer Society, Washington, DC, USA
Makino J, Kokubo E, Fukushige T, Daisaka H (2002) A 29.5 Tflops simulation of planetesimals in uranus-neptune region on grape-6. In: Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp 1–14. IEEE Computer Society Press, Los Alamitos, CA, USA
Warren MS, Salmon JK, Becker DJ, Goda MP, Sterling T (1997) Pentium Pro Inside: I. A Treecode at 430 Gflops on ASCI Red, II. Price/Performance of $50 /Mflop on Loki and Hyglac. In: Proc. Supercomputing 97, in CD-ROM. IEEE, Los Alamitos, CA
Springel V, White SDM, Jenkins A, Frenk CS, Yoshida N, Gao L, Navarro J, Thacker R, Croton D, Helly J, Peacock JA, Cole S, Thomas P, Couchman H, Evrard A, Colberg J, Pearce F (2005) Simulating the joint evolution of quasars, galaxies and their large-scale distribution. doi:10.1038/nature03597
Moore B, Diemand J, Madau P, Zemp M, Stadel J (2005) Globular clusters, satellite galaxies and stellar haloes from early dark matter peaks. doi:10.1111/j.1365-2966.2006.10116.x
Nyland L, Harris M, Prins J (2004) N-body simulations on a GPU. In: Proc of the ACM Workshop on General-Purpose Computation on Graphics Processors
Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: SIGGRAPH 2005 GPGPU COURSE. http://www.gpgpu.org/s2005/
Harris M (2005) GPGPU: General-Purpose Computation on GPUs. In: Game Developpers Conference
Zwart Portegies S, Belleman R, Geldof P (2007) High Performance Direct Gravitational N-body Simulations on Graphics Processing Unit. astro-ph/0702058
Hamada T, Iitaka T (2007) The chamomile scheme: An optimized algorithm for N-body simulations on programmable graphics processing units. http://arxiv.org/abs/astro-ph/0703100
Nyland L, Harris M, Prins J (2007) Fast N-body simulation with cuda. In: Nguyen H (ed) GPU Gems 3, chap. 31. Addison Wesley Professional
Belleman RG, Bedorf J, Zwart SP (2007) High performance direct gravitational N-body simulations on graphics processing units – ii: An implementation in cuda. doi:10.1016/j.newast.2007.07.004
Hamada T, Narumi T, Sakamaki T, Yasuoka K, Taiji M, Sagara T, Egami YKO (2008) The earliest scientific computation using cuda. In: Japan CUDA conference 2008, University of Tokyo
Barnes J (1990) A modified tree code: don’t laugh; it runs. J Computat Phys 87:161–170
Article MATH MathSciNet Google Scholar
Hamada T, Ohno Y, Morimoto G, Taiji M, Toshiaki I, Nitadori K (2007) Internals of the cunbody-1 library: particle/force decomposition and reduction. Princeton, NJ
Makino J (2004) A Fast Parallel Treecode with GRAPE. Publ Astron Soc Japan 56(3):521–531. http://grape.astron.s.u-tokyo.ac.jp/ makino/softwares/pC++tree
Nitadori K, Makino J, Hut P (2006) Performance tuning of N-body codes on modern microprocessors: I. direct integration with a hermite scheme on x86_64 architecture. New Astron 12:169. http://www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0511062
Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for gpu computing. In: GH ’07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pp 97–106. Eurographics Association, Aire-la-Ville, Switzerland

Download references

Author information

Authors and Affiliations

Faculty of Engineering, Department of Computer and Information Sciences, Nagasaki University, Bunkyo-machi, 852-8521, Nagasaki, Japan
Tsuyoshi Hamada, Tomonari Masada, Yuichiro Shibata & Kiyoshi Oguri
Department of Astronomy, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
Keigo Nitadori
School of Engineering, The University of Edinburgh, King’s Buildings, Mayfield Road, EH9 3JL, Edinburgh, Scotland, UK
Khaled Benkrid
RIKEN (The Institute of Physical and Chemical Research), 61-1 Ono, Tsurumi, Yokohama, 230-0046, Kanagawa, Japan
Yousuke Ohno, Gentaro Morimoto & Makoto Taiji

Authors

Tsuyoshi Hamada
View author publications
You can also search for this author in PubMed Google Scholar
Keigo Nitadori
View author publications
You can also search for this author in PubMed Google Scholar
Khaled Benkrid
View author publications
You can also search for this author in PubMed Google Scholar
Yousuke Ohno
View author publications
You can also search for this author in PubMed Google Scholar
Gentaro Morimoto
View author publications
You can also search for this author in PubMed Google Scholar
Tomonari Masada
View author publications
You can also search for this author in PubMed Google Scholar
Yuichiro Shibata
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoshi Oguri
View author publications
You can also search for this author in PubMed Google Scholar
Makoto Taiji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsuyoshi Hamada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hamada, T., Nitadori, K., Benkrid, K. et al. A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation . Comp. Sci. Res. Dev. 24, 21–31 (2009). https://doi.org/10.1007/s00450-009-0089-1

Download citation

Published: 20 May 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s00450-009-0089-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

Abstract

Access this article

Similar content being viewed by others

A Strategy to Workload Division for Massively Particle-Particle N-body Simulations on GPUs

Increasing Parallelism and Reducing Thread Contentions in Mapping Localized N-Body Simulations to GPUs

Up to 700k GPU Cores, Kepler, and the Exascale Future for Simulations of Star Clusters Around Black Holes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

Abstract

Access this article

Similar content being viewed by others

A Strategy to Workload Division for Massively Particle-Particle N-body Simulations on GPUs

Increasing Parallelism and Reducing Thread Contentions in Mapping Localized N-Body Simulations to GPUs

Up to 700k GPU Cores, Kepler, and the Exascale Future for Simulations of Star Clusters Around Black Holes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation