research-article

Scaling Performance for N-Body Stream Computation with a Ring of FPGAs

Authors:
Jens Huthmann

RIKEN, Kobe, Japan

RIKEN, Kobe, Japan
View Profile

,
Abiko Shin

Tohoku University, Sendai, Japan

Tohoku University, Sendai, Japan
View Profile

,
Artur Podobas

RIKEN, Kobe, Japan

RIKEN, Kobe, Japan
View Profile

,
Kentaro Sano

RIKEN, Kobe, Japan

RIKEN, Kobe, Japan
View Profile

,
Hiroyuki Takizawa

Tohoku University, Sendai, Japan

Tohoku University, Sendai, Japan
View Profile

HEART '19: Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable TechnologiesJune 2019Article No.: 10Pages 1–6https://doi.org/10.1145/3337801.3337813

Published:06 June 2019Publication History

HEART '19: Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

Pages 1–6

ABSTRACT

Field-Programmable Gate Arrays (FPGAs) offer a fairly non-invasive method to specialize custom architectures towards a specific application domain. Recent studies has successfully demonstrated that single-node FPGAs can be a rival to both CPUs and GPUs in performance. Unfortunately, most existing studies limit themselves to using a single FPGA devices, and their scalability requires more investigation.

In this work, we practically demonstrate how to scale the important n-body problem across a comparatively large FPGA cluster. Our design -- composed of up to 256 processing elements -- achieves near-linear strong scaling, with performance-levels comparable to that of custom Application-Specific Integrated Circuits (ASICs). We further develop an analytical performance model, which we use to predict the performance of our solution onto future upcoming Intel Agilex systems. Today, our system reaches up to 47 Giga-Pairs/second, and using our performance model we predict that we can reach up-to 0.142 Tera-Pairs/second peak performance with next-generation FPGAs.

References

Jeroen Bédorf, Evghenii Gaburov, Michiko S. Fujii, Keigo Nitadori, Tomoaki Ishiyama, and Simon Portegies Zwart. 2014. 24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs. International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015-Janua, January (2014), 54--65. arXiv:arXiv:1412.0659v1 Google ScholarDigital Library
Jeroen Bédorf, Evghenii Gaburov, and Simon Portegies Zwart. 2012. A sparse octree gravitational N-body code that runs entirely on the GPU processor. J. Comput. Phys. 231, 7 (2012), 2825--2839. arXiv:arXiv:1106.1900v2 Google ScholarDigital Library
Edmund Bertschinger. 1998. SIMULATIONS OF STRUCTURE FORMATION IN THE UNIVERSE. Annual Review of Astronomy and Astrophysics 36, 1 (sep 1998), 599--654.Google ScholarCross Ref
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Tomasz Czajkowski, Stephen D. Brown, and Jason H. Anderson. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems 13, 2 (sep 2013), 1--27. Google ScholarDigital Library
Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 93--96.Google ScholarCross Ref
HJW De Baar. 1994. von Liebig's law of the minimum and plankton ecology (1899-1991). Progress in oceanography 33, 4 (1994), 347--386.Google Scholar
Emanuele Del Sozzo, Marco Rabozzi, Lorenzo Di Tucci, Donatella Sciuto, and Marco D. Santambrogio. 2018. A Scalable FPGA Design for Cloud N-Body Simulation. Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors 2018-July (2018), 1--8.Google Scholar
G. Ya. Dynnikova. 2009. Fast technique for solving the N-body problem in flow simulation by vortex methods. Computational Mathematics and Mathematical Physics 49, 8 (aug 2009), 1389--1396.Google ScholarCross Ref
Dominique Escande, Didier Bénisti, Yves Elskens, David Zarzoso, and Fabrice Doveil. 2018. Basic microscopic plasma physics from N-body mechanics. (may 2018). arXiv:1805.11408 http://arxiv.org/abs/1805.11408Google Scholar
Michael J Flynn. 1966. Very high-speed computing systems. Proc. IEEE 54, 12 (1966), 1901--1909.Google ScholarCross Ref
Shuaizhi Guo, Tianqi Wang, Linfeng Tao, Teng Tian, Zikun Xiang, and Xi Jin. 2018. RP-Ring: A Heterogeneous Multi-FPGA Accelerator. 2018 (2018).Google Scholar
Jens Huthmann, Julian Oppermann, and Andreas Koch. 2014. Automatic high-level synthesis of multi-threaded hardware accelerators. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL). 1--4.Google ScholarCross Ref
Intel. 2019.. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/intel-agilex-i-series-product-table.pdfGoogle Scholar
Atsushi Kawai, Toshiyuki Fukushige, Junichiro Makino, and Makoto Taiji. 2000. GRAPE-5: A Special-Purpose Computer for N-Body Simulations. Publications of the Astronomical Society of Japan 52, 4 (2000), 659--676.Google ScholarCross Ref
J Makino. 2006. The GRAPE project. Computing in Science Engineering 8, 1 (jan 2006), 30--40. Google ScholarDigital Library
Junichiro Makino, Toshiyuki Fukushige, Masaki Koga, and Ken Namura. 2003. GRAPE-6: Massively-Parallel Special-Purpose Computer for Astrophysical Particle Simulations. Publications of the Astronomical Society of Japan 55, 6 (2003), 1163--1187.Google ScholarCross Ref
J Makino, K Hiraki, and M Inaba. 2007. GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing. In SC '07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 1--11. Google ScholarDigital Library
Junichiro Makino, Makoto Taiji, Toshikazu Ebisuzaki, and Daiichiro Sugimoto. 1997. GRAPEâĂR4: A Massively Parallel SpecialâĂRPurpose Computer for Collisional N âĂRBody Simulations. The Astrophysical Journal 480, 1 (may 1997), 432--446.Google ScholarCross Ref
Antoniette Mondigo, Kentaro Sano, and Hiroyuki Takizawa. 2018. Enhancing Memory Bandwidth in a Single Stream Computation with Multiple FPGAs. In The 2018 International Conference on Field-Programmable Technology. 9--11.Google ScholarCross Ref
Gordon E Moore et al. 1965. Cramming more components onto integrated circuits.Google Scholar
Keigo Nitadori, Junichiro Makino, and George Abe. 2006. High-Performance Small-Scale Simulation of Star Clusters Evolution on Cray XD1. arXiv.org (jun 2006), 6105. arXiv:astro-ph/0606105 http://arxiv.org/abs/astro-ph/0606105Google Scholar
C Pilato and F Ferrandi. 2012. Bambu: A Free Framework for the High Level Synthesis of Complex Applications. University Booth of DATE 29, 6 (2012), 2011. http://panda.dei.polimi.it/wp-content/uploads/PosterUB{_}DATE.pdfGoogle Scholar
Artur Podobas and Satoshi Matsuoka. 2017. Designing and accelerating spiking neural networks using OpenCL for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). IEEE, 255--258.Google ScholarCross Ref
Kentaro Sano, Shin Abiko, and Tomohiro Ueno. 2017. FPGA-based Stream Computing for High-Performance N-Body Simulation using Floating-Point DSP Blocks. In Proceedings of the 8th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies - HEART2017. ACM Press, New York, New York, USA, 1--6. Google ScholarDigital Library
Rainer Spurzem. 1999. Direct N-body simulations. J. Comput. Appl. Math. 109, 1--2 (1999), 407--432.Google ScholarCross Ref
Terasic. 2018 (accessed). DE5a-NET User Manual. https://www.terasic.com.tw/cgi-bin/page/archive_download.pl?Language=English&No=970&FID=0bc2c05d074b8a05252d9b8e363d69d1Google Scholar
M Mitchell Waldrop. 2016. The chips are down for MooreâĂ&Zacute;s law. Nature News 530, 7589 (2016), 144.Google ScholarCross Ref
Robert A Walker and Raul Camposano. 2012. A survey of high-level synthesis systems. Vol. 135. Springer Science & Business Media.Google Scholar
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multi-core architectures. Technical Report. Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States).Google ScholarCross Ref
Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and Satoshi Matsuoka. 2016. Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 409--420. Google ScholarDigital Library
Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. High-Performance High-Order Stencil Computation on FPGAs Using OpenCL. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 123--130.Google Scholar

Index Terms

Scaling Performance for N-Body Stream Computation with a Ring of FPGAs

Recommendations

Stream computing on fpgas
Read More
Integrating FPGAs in high-performance computing: the architecture and implementation perspective
FPGA '07: Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays

Today, many enterprises are evaluating and in some cases deploying heterogeneous computing platforms that include some form of hardware acceleration or co-processing. Such systems typically consist of commodity computing clusters augmented by hardware ...
Read More
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HEART '19: Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies
June 2019
106 pages
ISBN:9781450372558
DOI:10.1145/3337801

Copyright © 2019 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
HEART '19 Paper Acceptance Rate12of29submissions,41%Overall Acceptance Rate22of50submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 115
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling Performance for N-Body Stream Computation with a Ring of FPGAs

HEART '19: Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Stream computing on fpgas

Integrating FPGAs in high-performance computing: the architecture and implementation perspective

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scaling Performance for N-Body Stream Computation with a Ring of FPGAs

HEART '19: Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Stream computing on fpgas

Integrating FPGAs in high-performance computing: the architecture and implementation perspective

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media