research-article

Open access

From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus

Authors:

Brucek Khailany,

Tsung-Wei HuangAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 88, Pages 1 - 12

https://doi.org/10.1145/3545008.3545091

Published: 13 January 2023 Publication History

All formats PDF

Abstract

High-throughput RTL simulation is critical for verifying today’s highly complex SoCs. Recent research has explored accelerating RTL simulation by leveraging event-driven approaches or partitioning heuristics to speed up simulation on a single stimulus. To further accelerate throughput performance, industry-quality functional verification signoff must explore running multiple stimulus (i.e., batch stimulus) simultaneously, either with directed tests or random inputs. In this paper, we propose RTLFlow, a GPU-accelerated RTL simulation flow with batch stimulus. RTLflow first transpiles RTL into CUDA kernels that each simulates a partition of the RTL simultaneously across multiple stimulus. It also leverages CUDA Graph and pipeline scheduling for efficient runtime execution. Measuring experimental results on a large industrial design (NVDLA) with 65536 stimulus, we show that RTLflow running on a single A6000 GPU can achieve a 40 × runtime speed-up when compared to an 80-thread multi-core CPU baseline.

References

[1]

2012. Nvidia System Management Interface. https://developer.nvidia.com/nvidia-system-management-interface.

[2]

2012. Yosys. https://yosyshq.net/yosys/.

[3]

2016. Spinal. https://github.com/SpinalHDL/VexRiscv.

[4]

2017. Nvidia Deep Learning Accelerator Design (NVDLA). http://nvdla.org/.

[5]

2017. Nvidia Nsight Systems. https://developer.nvidia.com/nsight-systems.

[6]

2018. riscv-mini. https://github.com/ucb-bar/riscv-mini.

[7]

2019. CUDA Graph. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html.

[8]

2022. RTLflow. https://github.com/dian-lun-lin/RTLflow.

[9]

Scott Beamer and David Donofrio. 2020. Efficiently exploiting low activity factors to accelerate RTL simulation. In ACM/IEEE DAC. 1–6.

[10]

Debapriya Chatterjee, Andrew Deorio, and Valeria Bertacco. 2011. Gate-Level Simulation with GPU Computing. ACM TODAES 16, 3.

Digital Library

[11]

Cheng-Hsiang Chiu and Tsung-Wei Huang. 2022. Composing Pipeline Parallelism Using Control Taskflow Graph. In ACM HPDC. 283––284.

[12]

Cheng-Hsiang Chiu and Tsung-Wei Huang. 2022. Efficient Timing Propagation with Simultaneous Structural and Pipeline Parallelisms. In ACM/IEEE DAC.

[13]

Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. 1995. Markov chain Monte Carlo in practice. CRC press.

[14]

Guannan Guo, Tsung-Wei Huang, Yibo Lin, and Martin Wong. 2021. GPU-accelerated Pash-based Timing Analysis. In ACM/IEEE DAC.

[15]

Zizheng Guo, Tsung-Wei Huang, and Yibo Lin. 2020. GPU-accelerated Static Timing Analysis. In IEEE/ACM ICCAD. 1–8.

[16]

W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Oxford University Press.

[17]

Tsung-Wei Huang, Guannan Guo, Chun-Xun Lin, and Martin Wong. 2021. OpenTimer 2.0: A New Parallel Incremental Timing Analysis Engine. IEEE TCAD 40, 4 (2021), 776–789.

[18]

Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong. 2019. Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++. In IEEE IPDPS. 974–983.

[19]

Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin. 2021. Taskflow: A lightweight parallel and heterogeneous task graph computing system. IEEE Transactions on Parallel and Distributed Systems 33, 6, 1303–1320.

Digital Library

[20]

Tsung-Wei Huang and Martin Wong. 2015. OpenTimer: A high-performance timing analysis tool. In IEEE/ACM ICCAD. 895–902.

[21]

Tsung-Wei Huang, Martin D. F. Wong, Debjit Sinha, Kerim Kalafala, and Natesan Venkateswaran. 2016. A Distributed Timing Analysis Framework for Large Designs. In ACM/IEEE DAC. 116:1–116:6.

[22]

Chun-Xun Lin, Tsung-Wei Huang, and Martin D. F. Wong. 2020. An Efficient Work-Stealing Scheduler for Task Dependency Graph. In IEEE ICPADS. 64–71.

[23]

Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU Computation using Task Graph Parallelism. In Euro-Par. 435–450.

[24]

Dian-Lun Lin and Tsung-Wei Huang. 2022. Accelerating Large Sparse Neural Network Inference Using GPU Task Graph Parallelism. IEEE TPDS 33, 11 (2022), 3041–3052.

[25]

Lingyi Liu and Shobha Vasudevan. 2011. Efficient validation input generation in RTL by hybridized source code analysis. In 2011 Design, Automation Test in Europe. 1–6. https://doi.org/10.1109/DATE.2011.5763253

[26]

Hao Qian and Yangdong Deng. 2011. Accelerating RTL simulation with GPUs. In IEEE/ACM ICCAD. 687–693.

[27]

Vivek Sarkar. 1987. Partitioning and scheduling parallel programs for execution on multiprocessors. Ph. D. Dissertation. Stanford University.

[28]

Wilson Snyder. 2018. Verilator 4.0: open simulation goes multithreaded. https://veripool.org/papers/Verilator_v4_Multithreaded_OrConf2018.pdf.

[29]

Uri Tal. 2013. RocketSim: A GPU-based Simulation Accelerator for Chip Verification. https://on-demand-gtc.gputechconf.com/gtcnew/speakerName.php?speaker=Uri+Tal.

[30]

Laung-Terng Wang, Yao-Wen Chang, and Kwang-Ting (Tim) Cheng. 2009. Electronic Design Automation: Synthesis, Verification, and Test. Morgan Kaufmann Publishers Inc.

[31]

Yanqing Zhang, Haoxing Ren, and Brucek Khailany. 2020. Opportunities for RTL and Gate Level Simulation using GPUs. In IEEE/ACM ICCAD. 1–5.

[32]

Yanqing Zhang, Haoxing Ren, Akshay Sridharan, and Brucek Khailany. 2022. GATSPI: GPU Accelerated Gate-Level Simulation for Power Improvement. In IEEE/ACM DAC.

Digital Library

[33]

Yuhao Zhu, Bo Wang, and Yangdong Deng. 2011. Massively Parallel Logic Simulation with GPUs. ACM TODAES 16, 3.

Digital Library

Cited By

Nieto RMachado FFernández-Conde JLobato DCañas J(2025)Open-source ROS-based simulation for verification of FPGA robotics applicationsMicroprocessors and Microsystems10.1016/j.micpro.2025.105143113(105143)Online publication date: Mar-2025
https://doi.org/10.1016/j.micpro.2025.105143
Chang CZhang BHuang T(2024)GSAP: A GPU-Accelerated Stochastic Graph PartitionerProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673117(565-575)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673117
Xu RLuo JZhang YLin YWang RHuang RLiang Y(2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00062
Show More Cited By

Recommendations

Accelerating RTL simulation with GPUs
ICCAD '11: Proceedings of the International Conference on Computer-Aided Design

With the fast increasing complexity of integrated circuits, verification has become the bottleneck of today's IC design flow. In fact, over 70% of the IC design turn-around time can be spent on the verification process in a typical IC design project. ...
On the automatic generation of GPU-oriented software applications from RTL IPs
CODES+ISSS '13: Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Graphics processing units (GPUs) have been explored as a new computing paradigm for accelerating computation intensive applications. In particular, the combination between GPUs and CPU has proved to be an effective solution for accelerating the software ...
Accelerating RTL simulation with GPUs
ICCAD '11: Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design

With the fast increasing complexity of integrated circuits, verification has become the bottleneck of today's IC design flow. In fact, over 70% of the IC design turn-around time can be spent on the verification process in a typical IC design project. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
2,100
Total Downloads

Downloads (Last 12 months)1,284
Downloads (Last 6 weeks)192

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nieto RMachado FFernández-Conde JLobato DCañas J(2025)Open-source ROS-based simulation for verification of FPGA robotics applicationsMicroprocessors and Microsystems10.1016/j.micpro.2025.105143113(105143)Online publication date: Mar-2025
https://doi.org/10.1016/j.micpro.2025.105143
Chang CZhang BHuang T(2024)GSAP: A GPU-Accelerated Stochastic Graph PartitionerProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673117(565-575)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673117
Xu RLuo JZhang YLin YWang RHuang RLiang Y(2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00062
Tong JChang LOgras UHuang T(2024)BatchSim: Parallel RTL Simulation Using Inter-Cycle Batching and Task Graph Parallelism2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00155(789-793)Online publication date: 1-Jul-2024
https://doi.org/10.1109/ISVLSI61997.2024.00155
Morchdi CChiu CZhou YHuang TKim T(2024)A Resource-Efficient Task Scheduling System Using Reinforcement LearningProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473960(89-95)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473960
Sabu ALiu CCarlson T(2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3354069
Lin DOgras UMiguel JHuang T(2024)TaroRTL: Accelerating RTL Simulation Using Coroutine-Based Heterogeneous Task Graph SchedulingEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_11(151-166)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_11
Zhou KLiang YLin YWang RHuang R(2023)Khronos: Fusing Memory Access for Improved Hardware RTL SimulationProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614301(180-193)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614301
Gavier IRussell JPatel DRietman ESiegelmann H(2023)Neural Network Compiler for Parallel High-Throughput Simulation of Digital Circuits2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00067(613-623)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00067
Nema SChunduru SKodigal CVoskuilen GRodrigues AHemmert SFeinberg BLee HAwad AHughes C(2023)ERAS: A Flexible and Scalable Framework for Seamless Integration of RTL Models with Structural Simulation Toolkit2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00038(196-200)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00038
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten