research-article

Predictable GPU Wavefront Splitting for Safety-Critical Systems

Authors:

Artem Klashtorny,

Anirudh Mohan Kaushik,

Hiren PatelAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 5s

Article No.: 107, Pages 1 - 25

https://doi.org/10.1145/3609102

Published: 09 September 2023 Publication History

Abstract

We present a predictable wavefront splitting (PWS) technique for graphics processing units (GPUs). PWS improves the performance of GPU applications by reducing the impact of branch divergence while ensuring that worst-case execution time (WCET) estimates can be computed. This makes PWS an appropriate technique to use in safety-critical applications, such as autonomous driving systems, avionics, and space, that require strict temporal guarantees. In developing PWS on an AMD-based GPU, we propose microarchitectural enhancements to the GPU, and a compiler pass that eliminates branch serializations to reduce the WCET of a wavefront. Our analysis of PWS exhibits a performance improvement of 11% over existing architectures with a lower WCET than prior works in wavefront splitting.

References

[1]

Tor M. Aamodt, Wilson Wai Lun Fung, Timothy G. Rogers, and Margaret Martonosi. 2018. General-Purpose Graphics Processor Architecture. Morgan & Claypool. 21–26.

Digital Library

[2]

Advanced Micro Devices. 2016. Graphics Core Next Architecture Reference Guide. (2016).

[3]

Advanced Micro Devices. 2019. Introducing RDNA Architecture. (2019).

[4]

Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS). 104–115.

[5]

Pete Bannon, Ganesh Venkataramanan, Debjit Das Sarma, and Emil Talpes. 2019. Computer and redundancy solution for the full self-driving computer. In 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18–20, 2019. IEEE, 1–22.

[6]

Adam Betts and Alastair Donaldson. 2013. Estimating the WCET of GPU-accelerated applications using hybrid analysis. In 2013 25th Euromicro Conference on Real-Time Systems. 193–202.

Digital Library

[7]

Srikant Bharadwaj, Shomit Das, Yasuko Eckert, Mark Oskin, and Tushar Krishna. 2021. DUB: Dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads. In Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-Chip (NOCS’21). Association for Computing Machinery, New York, NY, USA, 49–54.

Digital Library

[8]

Benjamin Brosgol. 2011. DO-178C: The next avionics safety standard. In Proceedings of the 2011 ACM Annual International Conference on Special Interest Group on the Ada Programming Language (SIGAda’11). Association for Computing Machinery, New York, NY, USA, 5–6.

Digital Library

[9]

Nicolas Brunie, Caroline Collange, and Gregory Diamos. 2012. Simultaneous branch and warp interweaving for sustained GPU performance. In 2012 39th Annual International Symposium on Computer Architecture (ISCA’12). 49–60.

[10]

Sana Damani, Mark Stephenson, Ram Rangan, Daniel Johnson, Rishkul Kulkarni, and Stephen W. Keckler. 2022. GPU subwarp interleaving. In Proceedings of the International Symposium on High-Performance Computer Architecture.

[11]

Wilson W.L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 407–420.

Digital Library

[12]

Yijie Huangfu and Wei Zhang. 2017. Static WCET analysis of GPUs with predictable warp scheduling. In 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC’17). 101–108.

[13]

Jason Lowe-Power et al. 2020. The gem5 simulator: Version 20.0+. CoRR abs/2007.03152 (2020). arXiv:2007.03152 https://arxiv.org/abs/2007.03152

[14]

Kuen-Long Lu and Yung-Yuan Chen. 2019. ISO 26262 ASIL-oriented hardware design framework for safety-critical automotive systems. In 2019 IEEE International Conference on Connected Vehicles and Expo (ICCVE’19). 1–6.

[15]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). Association for Computing Machinery, New York, NY, USA, 235–246.

Digital Library

[16]

Nathan Otterness and James H. Anderson. 2021. Exploring AMD GPU scheduling details by experimenting with “Worst Practices”. In 29th International Conference on Real-Time Networks and Systems (RTNS’2021). Association for Computing Machinery, New York, NY, USA, 24–34.

Digital Library

[17]

Michael Platzer and Peter Puschner. 2021. Vicuna: A timing-predictable RISC-V vector coprocessor for scalable parallel computation. In 33rd Euromicro Conference on Real-Time Systems (ECRTS’21) (Leibniz International Proceedings in Informatics (LIPIcs)), Björn B. Brandenburg (Ed.), Vol. 196. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 1:1–1:18.

[18]

Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 591–602.

Digital Library

[19]

Corbin Robeck and Aryan Salmanpour. 2016. ROCm Developer Tools: HIP Examples. (2016). https://github.com/ROCm-Developer-Tools/HIP-Examples

[20]

Roy Spliet and Robert D. Mullins. 2022. Sim-D: A SIMD accelerator for hard real-time systems. IEEE Trans. Comput. 71, 4 (2022), 851–865.

[21]

Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith. 2018. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In 30th Euromicro Conference on Real-Time Systems, ECRTS 2018, July 3–6, 2018, Barcelona, Spain (LIPIcs), Sebastian Altmeyer (Ed.), Vol. 106. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 20:1–20:21.

[22]

Sharad Malik Yau-Tsun Steven Li. 1995. Performance analysis of embedded software using implicit path enumeration. In 32nd Design Automation Conference. 456–461.

[23]

Wei Zhang, Quan Chen, Kaihua Fu, Ningxin Zheng, Zhiyi Huang, Jingwen Leng, and Minyi Guo. 2022. Astraea: Towards QoS-aware and resource-efficient multi-stage GPU services. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22). Association for Computing Machinery, New York, NY, USA, 570–582.

Digital Library

Index Terms

Predictable GPU Wavefront Splitting for Safety-Critical Systems
1. Computer systems organization
  1. Real-time systems
    1. Real-time system architecture

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 5s

Special Issue ESWEEK 2023

October 2023

1394 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3614235

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 09 September 2023

Accepted: 30 June 2023

Revised: 02 June 2023

Received: 23 March 2023

Published in TECS Volume 22, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
187
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)13

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents