short-paper

DUB: dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads

Authors:
Srikant Bharadwaj

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Shomit Das

Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.
View Profile

,
Yasuko Eckert

Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.
View Profile

,
Mark Oskin

University of Washington

University of Washington
View Profile

,
Tushar Krishna

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

NOCS '21: Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-ChipOctober 2021Pages 49–54https://doi.org/10.1145/3479876.3481590

Published:08 October 2021Publication History

NOCS '21: Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-Chip

Pages 49–54

ABSTRACT

The performance of graphics processing units (GPU) workloads can be sensitive to the various clock domains which are dynamically tunable in modern GPUs. In this work, we observe that GPU application performance is sensitive towards NoC clock frequencies and the sensitivity varies during the execution of GPU kernels. We note that this heterogeneity is not adapted well by traditional dynamic voltage frequency scaling (DVFS) techniques. To that end, we introduce DUB, Dynamic Underclocking and Bypassing technique, for such heterogeneous GPU workloads. We enable bypassing re-timer flops and routers while underclocking the NoC frequency thus enabling high power savings at minimal performance loss. Compared to baseline we observe a 26% improvement in power savings with only 3% degradation in performance beating oracular DVFS techniques.

References

Johnathan Alsop et al. 2019. Optimizing GPU cache policies for MI workloads. In 2019 IISWC. IEEE, 243--248.Google Scholar
AMD. 2017. Radeon's next-generation Vega architecture. https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdfGoogle Scholar
Srikant Bharadwaj et al. 2018. Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects. In 2018 International Symposium on Microarchitecture. 271--284. Google ScholarDigital Library
Srikant Bharadwaj et al. 2020. Kite: A Family of Heterogeneous Interposer Topologies Enabled via Accurate Interconnect Modeling. In DAC 2020 (USA) (DAC '20). IEEE Press, Article 144, 6 pages.Google Scholar
Nathan Binkert et al. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarDigital Library
Xi Chen et al. 2012. In-network Monitoring and Control Policy for DVFS of CMP Networks-on-Chip and Last Level Caches. In 2012 NOCS. 43--50. Google ScholarDigital Library
Jack Choquette et al. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro (2021).Google Scholar
A. Gutierrez et al. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In HPCA. Google ScholarCross Ref
Fettes et al. 2019. Dynamic Voltage and Frequency Scaling in NoCs with Supervised and Reinforcement Learning Techniques. IEEE Trans. Comput. 68, 3 (2019), 375--389. Google ScholarDigital Library
Jason Lowe-Power et al. 2020. The gem5 Simulator: Version 20.0+. arXiv:2007.03152 [cs.AR]Google Scholar
Robert Hesse and Natalie Enright Jerger. 2015. Improving DVFS in NoCs with coherence prediction. In Proceedings of the 9th International Symposium on Networks-on-Chip. 1--8.Google ScholarDigital Library
Natalie D. Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks, Second Edition. Morgan & Claypool Publishers.Google Scholar
M. Kar and T. Krishna. 2017. A case for low frequency single cycle multi hop NoCs for energy efficiency and high performance. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 743--750. Google ScholarDigital Library
T. Krishna et al. 2013. Breaking the on-chip latency barrier using SMART In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 378--389. Google ScholarDigital Library
Tao Li and Greg Sadowski. 2014. Design and implementation of novel source synchronous interconnection in modern GPU chips. In 2014 27th IEEE International System-on-Chip Conference (SOCC). IEEE, 130--135.Google ScholarCross Ref
Samuel Naffziger et al. 2021. Pioneering Chiplet Technology and Design for the AMD EPYC^™ and Ryzen^™ Processor Families : Industrial Product. In ACM/IEEE ISCA. 57--70. Google ScholarDigital Library
Yuan Yao and Zhonghai Lu. 2016. Memory-access aware dvfs for network-on-chip in cmps. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1433--1436.Google Scholar
Xianwei Zhang and Evgeny Shcherbakov. 2020. DELTA: Validate GPU Memory Profiling with Microbenchmarks. Association for Computing Machinery, New York, NY, USA, 97--104. Google ScholarDigital Library

Index Terms

DUB: dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads
1. Networks
  1. Network types
    1. Network on chip

Recommendations

On the efficiency of the accelerated processing unit for scientific computing
HPC '16: Proceedings of the 24th High Performance Computing Symposium

The AMD APU (Accelerated Processing Unit) architecture, which combines CPU and GPU cores on the same die at a low power budget, promises a significant advent in GPU computing, in particular to applications which performance is bottlenecked by the low ...
Read More
Parallelism via Multithreaded and Multicore CPUs

Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
Read More
Traffic-aware power optimization for network applications on multicore servers
DAC '12: Proceedings of the 49th Annual Design Automation Conference

In this paper, we design, implement, and evaluate a traffic-aware and power-efficient multicore server system by translating incoming traffic rate to appropriate system operating level, which is then translated to optimal per-core frequency ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
NOCS '21: Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-Chip
October 2021
91 pages
ISBN:9781450390835
DOI:10.1145/3479876
General Chairs:
Tushar Krishna
Georgia Institute of Technology
,
John Kim
Korea Advanced Institute of Science and Technology
,
Program Chairs:
Sergi Abadal
Universitat Politècnica de Catalunya
,
Joshua San Miguel
University of Wisconsin-Madison
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic voltage frequency scaling (DVFS)
power efficiency
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate14of44submissions,32%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DUB: dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads

NOCS '21: Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-Chip

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the efficiency of the accelerated processing unit for scientific computing

Parallelism via Multithreaded and Multicore CPUs

Traffic-aware power optimization for network applications on multicore servers