Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

Kim, Sunggon; Kim, Dongwhan; Son, Yongseok; Eom, Hyeonsang

doi:10.1007/s10586-020-03105-2

Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

Published: 22 April 2020

Volume 23, pages 2261–2272, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

447 Accesses
1 Citation
Explore all metrics

Abstract

General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Article 04 September 2019

Notes

The current execution of two lmi application is N/A since the number of GPGPU cores are not enough to run two lmi applications

References

Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for gpgpu spatial multitasking. In: 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)
Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., Chakradhar, S.: A virtual memory based runtime to support multi-tenancy in clusters with gpus. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 97–108. ACM (2012)
Belviranli, M.E., Khorasani, F., Bhuyan, L.N., Gupta, R.: Cumas: data transfer aware multi-application scheduling for shared gpus. In: Proceedings of the 2016 International Conference on Supercomputing, p. 31. ACM (2016)
Bradley, T.: Gpu Performance Analysis and Optimisation. NVIDIA Corporation, Santa Clara (2012)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54. IEEE (2009)
Cuomo, S., Galletti, A., Marcellino, L., Navarra, G., Toraldo, G.: On gpu-cuda as preprocessing of fuzzy-rough data reduction by means of singular value decomposition. Soft Comput. 22(5), 1525–1532 (2018)
Article Google Scholar
Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Dimitrov, M., Mantor, M., Zhou, H.: Understanding software approaches for gpgpu reliability. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 94–104. ACM (2009)
Elliott, G.A., Ward, B.C., Anderson, J.H.: Gpusync: a framework for real-time gpu management. In: 2013 IEEE 34th Real-Time Systems Symposium, pp. 33–44. IEEE (2013)
Gharaibeh, A., Ripeanu, M.: Size matters: Space/time tradeoffs to improve gpgpu applications performance. In: 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2010)
Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A gpgpu transparent virtualization component for high performance computing clouds. In: European Conference on Parallel Processing, pp. 379–391. Springer (2010)
Han, T.D., Abdelrahman, T.S.: hicuda: high-level gpgpu programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)
Article Google Scholar
Huang, S., Xiao, S., Feng, W.c.: On the energy efficiency of graphics processing units for scientific computing. In: IEEE International Symposium on Parallel & Distributed Processing, 2009. IPDPS 2009, pp. 1–8. IEEE (2009)
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for cpu-gpu architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 165–174. ACM (2012)
Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in cuda. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp. 1–6. IEEE (2014)
Lee, S.Y., Wu, C.J.: Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference. In: 2017 IEEE International Symposium on Workload Characterization (IISWC), pp. 43–53. IEEE (2017)
Nvidia, C.: Nvidia cuda c programming guide. Nvidia Corp. 120(18), 8 (2011)
Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)
Article Google Scholar
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving gpgpu concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)
Sajjapongse, K., Wang, X., Becchi, M.: A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with gpus. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, pp. 179–190. ACM (2013)
Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R.: Accelerating s3d: a gpgpu case study. In: European Conference on Parallel Processing, pp. 122–131. Springer (2009)
Spafford, K., Meredith, J.S., Vetter, J.S.: Quantifying numa and contention effects in multi-gpu systems. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 11. ACM (2011)
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)
Article Google Scholar
Tallent, N.R., Gawande, N.A., Siegel, C., Vishnu, A., Hoisie, A.: Evaluating on-node gpu interconnects for deep learning workloads. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pp. 3–21. Springer (2017)
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 193–204. IEEE (2014)
Wang, C., Li, X., Zhang, J., Zhou, X., Nie, X.: Mp-tomasulo: a dependency-aware automatic parallel execution engine for sequential programs. ACM Trans. Architect. Code Optim. (TACO) 10(2), 1–26 (2013)
Article Google Scholar
Wang, K., Ding, X., Lee, R., Kato, S., Zhang, X.: Gdm: device memory management for gpgpu computing. ACM SIGMETRICS Perform. Eval. Rev. 42(1), 533–545 (2014)
Article Google Scholar
Wang, T., Oral, S., Wang, Y., Settlemyer, B., Atchley, S., Yu, W.: Burstmem: a high-performance burst buffer system for scientific applications. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 71–79. IEEE (2014)
Wende, F., Steinke, T., Cordes, F.: Multi-threaded kernel offloading to gpgpu using hyper-q on kepler architecture. Zuse Institute Berlin Report, pp. 1–17 (2014)
Zhou, X., Skjellum, A., Curry, M.L.: Poster: evaluating asynchrony in gibraltar raid. In: 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), pp. 1498–1498. IEEE (2012)

Download references

Acknowledgements

This work was supported by by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. 2016M3C4A7952587, 2017R1A2B4004513, and 2018R1C1B5085640), and the BK21 Plus for Pioneers in Innovative Computing (Dept. of Computer Science and Engineering, SNU) funded by National Research Foundation of Korea (NRF) (No. 21A20151113068).

Author information

Authors and Affiliations

Seoul National University, Seoul, South Korea
Sunggon Kim & Hyeonsang Eom
Samsung Electronics, System LSI Business, Yongin-Si, South Korea
Dongwhan Kim
Chung-Ang University, Seoul, South Korea
Yongseok Son

Authors

Sunggon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dongwhan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yongseok Son
View author publications
You can also search for this author in PubMed Google Scholar
Hyeonsang Eom
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyeonsang Eom.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Kim, D., Son, Y. et al. Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment. Cluster Comput 23, 2261–2272 (2020). https://doi.org/10.1007/s10586-020-03105-2

Download citation

Received: 29 November 2019
Revised: 28 January 2020
Accepted: 02 April 2020
Published: 22 April 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10586-020-03105-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Containerization technologies: taxonomies, applications and challenges

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Containerization technologies: taxonomies, applications and challenges

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation