short-paper

Hardware Specialization: Estimating Monte Carlo Cross-Section Lookup Kernel Performance and Area

Authors:

Kazutomo Yoshii,

Pete BeckmanAuthors Info & Claims

SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1274 - 1278

https://doi.org/10.1145/3624062.3625534

Published: 12 November 2023 Publication History

Abstract

Hardware specialization is one of the promising directions in the post-Moore era. It is imperative to understand how hardware specialization paradigms can benefit HPC. An essential question revolves around estimating the theoretical performance of an optimally specialized architecture without requiring extensive hardware development expertise and efforts. Focusing on the Monte Carlo cross-section lookup kernel, known for its notably low resource utilization, we develop a workflow to simulate a specialized architecture’s timing and estimate resource usage to answer these questions, leveraging open-source hardware tools. We implement building blocks of the kernel pipeline in the Chisel construction language and generate Verilog codes for resource estimation. Our late-breaking results show that the kernel latency is 46 cycles per lookup while the optimized CPU code takes 680 cycles, and a potential 15k pipeline copies within a 698 mm2 die, reflective of the Intel Xeon Platinum 8180 dimensions.

Supplemental Material

MP4 File

Recording of "Hardware Specialization: Estimating Monte Carlo Cross-Section Lookup Kernel Performance and Area" presentation at PMBS23.

Download
157.09 MB

References

[1]

Muhammad Shoaib Bin Altaf and David A Wood. 2014. LogCA: a performance model for hardware accelerators. IEEE Computer Architecture Letters 14, 2 (2014), 132–135.

Digital Library

[2]

J Bachrach, H Vo, B Richards, and Y Lee DAC an d 2012 Design. 2012. Chisel: constructing hardware in a Scala embedded language. DAC Design Automation Conference (2012), 1212–1221.

[3]

JW Jonathan Bachrach and Krste Asanović. 2017. Chisel 3.0 Tutorial. EECS Department, UC Berkeley, Tech. Rep. (2017).

[4]

Yinxiao Feng and Kaisheng Ma. 2022. Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 121–126.

Digital Library

[5]

Jaydeep P Kulkarni, John Keane, Kyung-Hoae Koo, Satyanand Nalam, Zheng Guo, Eric Karl, and Kevin Zhang. 2016. 5.6 Mb/mm2 1R1W 8T SRAM Arrays Operating Down to 560 mV Utilizing Small-Signal Sensing With Charge Shared Bitline and Asymmetric Sense Amplifier in 14 nm FinFET CMOS Technology. IEEE Journal of Solid-State Circuits 52, 1 (2016), 229–239.

[6]

Gary Lauterbach. 2021. The path to successful wafer-scale integration: The cerebras story. IEEE Micro 41, 6 (2021), 52–57.

Digital Library

[7]

Paul K Romano, Nicholas E Horelik, Bryan R Herman, Adam G Nelson, Benoit Forget, and Kord Smith. 2015. OpenMC: A state-of-the-art Monte Carlo code for research and development. Annals of Nuclear Energy 82 (2015), 90–97.

[8]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. ACM SIGARCH Computer Architecture News 42, 3 (2014), 97–108.

Digital Library

[9]

David E Shaw, Peter J Adams, Asaph Azaria, Joseph A Bank, Brannon Batson, Alistair Bell, Michael Bergdorf, Jhanvi Bhatt, J Adam Butts, Timothy Correia, 2021. Anton 3: twenty microseconds of molecular dynamics simulation before lunch. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11.

Digital Library

[10]

A. Siegel, K. Smith, K. Felker, P. Romano, B. Forget, and P. Beckman. 2014. Improved cache performance in Monte Carlo transport calculations using energy banding. Computer Physics Communications 185, 4 (2014), 1195–1199. https://doi.org/10.1016/j.cpc.2013.10.008

[11]

Dylan Stow, Itir Akgun, Russell Barnes, Peng Gu, and Yuan Xie. 2016. Cost and thermal analysis of high-performance 2.5 D and 3D integrated circuit design space. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 637–642.

[12]

John R Tramm, Paul K Romano, Johannes Doerfert, Amanda L Lund, Patrick C Shriwise, Andrew R Siegel, Gavin Ridley, and Andrew Pastrello. 2022. Toward Portable GPU Acceleration of the OpenMC Monte Carlo Particle Transport Code. In PHYSOR 2022 - International Conference on Physics of Reactors.

[13]

John R Tramm and Andrew R Siegel. 2014. Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes. In SNA+ MC 2013-Joint International Conference on Supercomputing in Nuclear Applications+ Monte Carlo. EDP Sciences, 04208.

[14]

Berkeley University of California. 2015. Berkeley Hardware Floating-Point Units Written in Chisel. https://github.com/ucb-bar/berkeley-hardfloat.

[15]

Chen Yang, Tong Geng, Tianqi Wang, Rushi Patel, Qingqing Xiong, Ahmed Sanaullah, Chunshu Wu, Jiayi Sheng, Charles Lin, Vipin Sachdeva, 2019. Fully integrated FPGA molecular dynamics simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–31.

Digital Library

Cited By

Tramm JAllen BYoshii KSiegel AWilson L(2023)Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator HardwareComputer Physics Communications10.1016/j.cpc.2023.109072(109072)Online publication date: Dec-2023
https://doi.org/10.1016/j.cpc.2023.109072

Recommendations

Streaming Hardware Compressor Generator Framework
SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

The interest in and strong demand for application-specific accelerators in computing and sensor data processing are rising. Simultaneously, data movement bottlenecks are increasingly becoming a significant limiting factor for these accelerators. ...
Performance Portability Evaluation of Blocked Stencil Computations on GPUs
SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

In this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, ...
Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Multi-chip integration is widely recognized as the extension of Moore's Law. Cost-saving is a frequently mentioned advantage, but previous works rarely present quantitative demonstrations on the cost superiority of multi-chip integration over monolithic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

November 2023

2180 pages

ISBN:9798400707858

DOI:10.1145/3624062

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

SC-W 2023

SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

November 12 - 17, 2023

CO, Denver, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
62
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tramm JAllen BYoshii KSiegel AWilson L(2023)Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator HardwareComputer Physics Communications10.1016/j.cpc.2023.109072(109072)Online publication date: Dec-2023
https://doi.org/10.1016/j.cpc.2023.109072

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten