ABSTRACT
The performance of graphics processing units (GPU) workloads can be sensitive to the various clock domains which are dynamically tunable in modern GPUs. In this work, we observe that GPU application performance is sensitive towards NoC clock frequencies and the sensitivity varies during the execution of GPU kernels. We note that this heterogeneity is not adapted well by traditional dynamic voltage frequency scaling (DVFS) techniques. To that end, we introduce DUB, <u>D</u>ynamic <u>U</u>nderclocking and <u>B</u>ypassing technique, for such heterogeneous GPU workloads. We enable bypassing re-timer flops and routers while underclocking the NoC frequency thus enabling high power savings at minimal performance loss. Compared to baseline we observe a 26% improvement in power savings with only 3% degradation in performance beating oracular DVFS techniques.
- Johnathan Alsop et al. 2019. Optimizing GPU cache policies for MI workloads. In 2019 IISWC. IEEE, 243--248.Google Scholar
- AMD. 2017. Radeon's next-generation Vega architecture. https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdfGoogle Scholar
- Srikant Bharadwaj et al. 2018. Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects. In 2018 International Symposium on Microarchitecture. 271--284. Google ScholarDigital Library
- Srikant Bharadwaj et al. 2020. Kite: A Family of Heterogeneous Interposer Topologies Enabled via Accurate Interconnect Modeling. In DAC 2020 (USA) (DAC '20). IEEE Press, Article 144, 6 pages.Google Scholar
- Nathan Binkert et al. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. Google ScholarDigital Library
- Xi Chen et al. 2012. In-network Monitoring and Control Policy for DVFS of CMP Networks-on-Chip and Last Level Caches. In 2012 NOCS. 43--50. Google ScholarDigital Library
- Jack Choquette et al. 2021. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro (2021).Google Scholar
- A. Gutierrez et al. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In HPCA. Google ScholarCross Ref
- Fettes et al. 2019. Dynamic Voltage and Frequency Scaling in NoCs with Supervised and Reinforcement Learning Techniques. IEEE Trans. Comput. 68, 3 (2019), 375--389. Google ScholarDigital Library
- Jason Lowe-Power et al. 2020. The gem5 Simulator: Version 20.0+. arXiv:2007.03152 [cs.AR]Google Scholar
- Robert Hesse and Natalie Enright Jerger. 2015. Improving DVFS in NoCs with coherence prediction. In Proceedings of the 9th International Symposium on Networks-on-Chip. 1--8.Google ScholarDigital Library
- Natalie D. Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks, Second Edition. Morgan & Claypool Publishers.Google Scholar
- M. Kar and T. Krishna. 2017. A case for low frequency single cycle multi hop NoCs for energy efficiency and high performance. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 743--750. Google ScholarDigital Library
- T. Krishna et al. 2013. Breaking the on-chip latency barrier using SMART In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 378--389. Google ScholarDigital Library
- Tao Li and Greg Sadowski. 2014. Design and implementation of novel source synchronous interconnection in modern GPU chips. In 2014 27th IEEE International System-on-Chip Conference (SOCC). IEEE, 130--135.Google ScholarCross Ref
- Samuel Naffziger et al. 2021. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In ACM/IEEE ISCA. 57--70. Google ScholarDigital Library
- Yuan Yao and Zhonghai Lu. 2016. Memory-access aware dvfs for network-on-chip in cmps. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1433--1436.Google Scholar
- Xianwei Zhang and Evgeny Shcherbakov. 2020. DELTA: Validate GPU Memory Profiling with Microbenchmarks. Association for Computing Machinery, New York, NY, USA, 97--104. Google ScholarDigital Library
Index Terms
- DUB: dynamic underclocking and bypassing in nocs for heterogeneous GPU workloads
Recommendations
On the efficiency of the accelerated processing unit for scientific computing
HPC '16: Proceedings of the 24th High Performance Computing SymposiumThe AMD APU (Accelerated Processing Unit) architecture, which combines CPU and GPU cores on the same die at a low power budget, promises a significant advent in GPU computing, in particular to applications which performance is bottlenecked by the low ...
Parallelism via Multithreaded and Multicore CPUs
Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
Traffic-aware power optimization for network applications on multicore servers
DAC '12: Proceedings of the 49th Annual Design Automation ConferenceIn this paper, we design, implement, and evaluate a traffic-aware and power-efficient multicore server system by translating incoming traffic rate to appropriate system operating level, which is then translated to optimal per-core frequency ...
Comments