research-article

GPGPU performance estimation for frequency scaling using cross-benchmarking

Authors:

Xiaowen ChuAuthors Info & Claims

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 31 - 40

https://doi.org/10.1145/3366428.3380767

Published: 23 February 2020 Publication History

Abstract

Dynamic Voltage and Frequency Scaling (D VFS) on General-Purpose Graphics Processing Units (GPGPUs) is now becoming one of the most significant techniques to balance computational performance and energy consumption. However, there are still few fast and accurate models for predicting GPU kernel execution time under different core and memory frequency settings, which is important to determine the best frequency configuration for energy saving. Accordingly, a novel GPGPU performance estimation model with both core and memory frequency scaling is herein proposed. We design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre-training or as supplementary training samples. Then we apply two different machine learning algorithms, Support Vector Regression (SVR) and Gradient Boosting Decision Tree (GBDT), to study the correlation between kernel performance counters and kernel performance. The models trained only with our cross-benchmarking suite achieve satisfying accuracy (16%~22% mean absolute error) on 24 unseen real application kernels. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the proposed model is able to achieve accurate results (5.1%, 2.8%, 6.5% mean absolute error) for the target GPUs (GTX 980, Titan X Pascal and Tesla P100).

References

[1]

Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, and Martin Peres. 2014. Power and performance characterization and modeling of GPU-accelerated systems. In IEEE IPDPS 2014. 113--122.

Digital Library

[2]

Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 55--58.

[3]

Robert A Bridges, Neena Imam, and Tiffany M Mintz. 2016. Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods. ACM CSUR 49, 3 (2016), 41.

Digital Library

[4]

X. Chu C. Liu, Q. Wang and Y.W. Leung. 2018. G-CRS: GPU accelerated Cauchy Reed-Solomon coding. IEEE TPDS 29, 7 (2018), 1484--1498.

[5]

Vincent Chau, Xiaowen Chu, Hai Liu, and Yiu-Wing Leung. 2017. Energy Efficient Job Scheduling with DVFS for CPU-GPU Heterogeneous Systems. In ACM e-Energy'17.

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization (IISWC) 2009. IEEE International Symposium on. IEEE, 44--54.

[7]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785--794.

Digital Library

[8]

Xiaowen Chu, Kaiyong Zhao, and Mea Wang. 2009. Practical random linear network coding on GPUs. In 2009 IFIP Networking.

[9]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th international conference on machine learning. 1337--1345.

Digital Library

[10]

Thanh Tuan Dao, Jungwon Kim, Sangmin Seo, Bernhard Egger, and Jaejin Lee. 2015. A performance model for GPUs with caches. IEEE TPDS 26, 7 (2015), 1800--1813.

[11]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.

Digital Library

[12]

Kaijie Fan, Biagio Cosenza, and Ben Juurlink. 2019. Predictable GPUs Frequency Scaling for Energy and Performance. In Proceedings of the 48th ICPP, 2019. IEEE.

Digital Library

[13]

J. Guerreiro, A. Ilic, N. Roma, and P. Tomas. 2018. GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling. In 2018 IEEE HPCA. 789--800.

[14]

João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. DVFS-aware application classification to improve GPGPUs energy efficiency. Parallel Comput. 83 (2019), 93--117.

Digital Library

[15]

Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th ISCA. ACM, 152--163.

Digital Library

[16]

Y. Huang, B. Guo, and Y. Shen. 2019. GPU Energy Consumption Optimization With a Global-Based Neural Network Method. IEEE Access 7 (2019).

[17]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[18]

Qing Jiao, Mian Lu, Huynh Phung Huynh, and Tulika Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In 2015 IEEE/ACM International Symposium on CGO. 1--11.

[19]

David HK Kim, Connor Imes, and Henry Hoffmann. 2015. Racing and Pacing to Idle: Theoretical and Empirical Analysis of Energy Optimization Heuristics. In CPSNA, 2015 IEEE 3rd International Conference on. IEEE, 78--85.

Digital Library

[20]

Jungseob Lee, Vijay Sathisha, Michael Schulte, Katherine Compton, and Nam Sung Kim.2011. Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling. In 2011 International Conference on PACT. 111--120.

[21]

You Li, Kaiyong Zhao, Xiaowen Chu, and Jiming Liu. 2010. Speeding up k-means algorithm by GPUs. In 2010 IEEE CIT. 115--122.

[22]

Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-Ming Yiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, Kaiyong Zhao, Ruiqiang Li, and Tak-Wah Lam. 2012. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28, 6 (2012), 878--879.

Digital Library

[23]

Xiaohan Ma, Mian Dong, Lin Zhong, and Zhigang Deng. 2009. Statistical power consumption analysis and modeling for GPU-based computing. In ACM Hot-Power'09.

[24]

J. Macri. 2015. AMD's next generation GPU and high bandwidth memory architecture: FURY. In 2015 IEEE Hot Chips 27 Symposium (HCS). 1--26.

[25]

Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE TPDS 28, 1 (Jan 2017), 72--86.

[26]

Xinxin Mei, Xiaowen Chu, Hai Liu, Yiu-Wing Leung, and Zongpeng Li. 2017. Energy efficient real-time task scheduling on CPU-GPU hybrid clusters. In IEEE INFOCOM 2017.

[27]

Xinxin Mei, Qiang Wang, and Xiaowen Chu. 2017. A survey and measurement study of GPU DVFS on energy conservation. Digital Communications and Networks 3, 2 (2017), 89 -- 100.

[28]

Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. 2014. Benchmarking the memory hierarchy of modern GPUs. In Network and Parallel Computing. Springer, 144--156.

[29]

Hitoshi Nagasaka, Naoya Maruyama, Akira Nukada, Toshio Endo, and Satoshi Matsuoka. 2010. Statistical power modeling of GPU kernels using performance counters. In Green Computing Conference, 2010. IEEE, 115--122.

Digital Library

[30]

Rajib Nath and Dean Tullsen. 2015. The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. In Proceedings of the 48th MICRO.

Digital Library

[31]

NVIDIA. 2018. CUDA C Programming Guide. [Online] http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[32]

NVIDIA. 2018. GPU Computing SDK. [Online] https://developer.nvidia.com/gpu-computing-sdk.

[33]

NVIDIA. 2018. NVIDIA Management Library. [Online] https://developer.nvidia.com/nvidia-management-library-nvml.

[34]

NVIDIA. 2018. NVIDIA Profiler. [Online] http://docs.nvidia.com/cuda/profiler-users-guide.

[35]

S. Shi, Q. Wang, and X. Chu. 2018. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. In 2018 IEEE DataCom. 949--957.

[36]

S. Shi, Q. Wang, P. Xu, and X. Chu. 2016. Benchmarking State-of-the-Art Deep Learning Software Tools. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD). 99--104.

[37]

Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of power-performance efficiency on emergent gpu architectures. In 2013 IEEE IPDPS. 673--686.

[38]

Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2018. TOP500. [Online] https://www.top500.org/lists/2019/11/.

[39]

Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. 2019. The Impact of GPU DVFS on the Energy and Performance of Deep Learning: An Empirical Study. In ACM e-Energy'19. Phoenix, AZ, USA, 315--325.

[40]

Qiang Wang and Xiaowen Chu. 2018. GPGPU Performance Estimation with Core and Memory Frequency Scaling. In 2018 IEEE ICPADS.

[41]

X. Wang, K. Huang, A. Knoll, and X. Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In 2019 IEEE HPCA. 506--518.

[42]

Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, and Xiaowen Chu. 2019. Benchmarking the Performance and Power of AI Accelerators for AI Training. arXiv:cs.DC/1909.06842

[43]

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. [n.d.]. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, 2010. IEEE, 235--246.

[44]

Gene Wu, Joseph L Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE HPCA. IEEE, 564--576.

[45]

Kaiyong Zhao and Xiaowen Chu. 2014. G-BLASTN: accelerating nucleotide alignment by graphics processors. Bioinformatics 30, 10 (2014), 1384--1391.

Cited By

Wang ZZhang YWei FWang BLiu YHu ZZhang JXu XHe JWang XDou WChen GTian CEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy EfficiencyProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707231(1118-1132)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707231
Zhang YWang QLin ZXu PWang B(2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629584
Wang QLi LLuo WZhang YWang B(2024)DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682917(1-6)Online publication date: 19-Jun-2024
https://doi.org/10.1109/IWQoS61813.2024.10682917
Show More Cited By

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study
e-Energy '19: Proceedings of the Tenth ACM International Conference on Future Energy Systems

Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer vision and natural ...
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

February 2020

77 pages

ISBN:9781450370257

DOI:10.1145/3366428

Conference Chairs:
Adwait Jog,
Onur Kayiran,
Ashutosh Pattnaik

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Hong Kong RGC

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 23, 2020

California, San Diego

Acceptance Rates

GPGPU '20 Paper Acceptance Rate 7 of 12 submissions, 58%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZZhang YWei FWang BLiu YHu ZZhang JXu XHe JWang XDou WChen GTian CEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy EfficiencyProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707231(1118-1132)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707231
Zhang YWang QLin ZXu PWang B(2024)Improving GPU Energy Efficiency through an Application-transparent Frequency Scaling Policy with Performance AssuranceProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629584(769-785)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629584
Wang QLi LLuo WZhang YWang B(2024)DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682917(1-6)Online publication date: 19-Jun-2024
https://doi.org/10.1109/IWQoS61813.2024.10682917
Adamek KNovotny JThiyagalingam JArmour W(2021)Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-Time Edge ComputingIEEE Access10.1109/ACCESS.2021.30534099(18167-18182)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3053409

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten