Skip to main content
Log in

Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The current execution of two lmi application is N/A since the number of GPGPU cores are not enough to run two lmi applications

References

  1. Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for gpgpu spatial multitasking. In: 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)

  2. Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., Chakradhar, S.: A virtual memory based runtime to support multi-tenancy in clusters with gpus. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 97–108. ACM (2012)

  3. Belviranli, M.E., Khorasani, F., Bhuyan, L.N., Gupta, R.: Cumas: data transfer aware multi-application scheduling for shared gpus. In: Proceedings of the 2016 International Conference on Supercomputing, p. 31. ACM (2016)

  4. Bradley, T.: Gpu Performance Analysis and Optimisation. NVIDIA Corporation, Santa Clara (2012)

    Google Scholar 

  5. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54. IEEE (2009)

  6. Cuomo, S., Galletti, A., Marcellino, L., Navarra, G., Toraldo, G.: On gpu-cuda as preprocessing of fuzzy-rough data reduction by means of singular value decomposition. Soft Comput. 22(5), 1525–1532 (2018)

    Article  Google Scholar 

  7. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  8. Dimitrov, M., Mantor, M., Zhou, H.: Understanding software approaches for gpgpu reliability. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 94–104. ACM (2009)

  9. Elliott, G.A., Ward, B.C., Anderson, J.H.: Gpusync: a framework for real-time gpu management. In: 2013 IEEE 34th Real-Time Systems Symposium, pp. 33–44. IEEE (2013)

  10. Gharaibeh, A., Ripeanu, M.: Size matters: Space/time tradeoffs to improve gpgpu applications performance. In: 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2010)

  11. Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A gpgpu transparent virtualization component for high performance computing clouds. In: European Conference on Parallel Processing, pp. 379–391. Springer (2010)

  12. Han, T.D., Abdelrahman, T.S.: hicuda: high-level gpgpu programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)

    Article  Google Scholar 

  13. Huang, S., Xiao, S., Feng, W.c.: On the energy efficiency of graphics processing units for scientific computing. In: IEEE International Symposium on Parallel & Distributed Processing, 2009. IPDPS 2009, pp. 1–8. IEEE (2009)

  14. Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for cpu-gpu architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 165–174. ACM (2012)

  15. Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in cuda. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp. 1–6. IEEE (2014)

  16. Lee, S.Y., Wu, C.J.: Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference. In: 2017 IEEE International Symposium on Workload Characterization (IISWC), pp. 43–53. IEEE (2017)

  17. Nvidia, C.: Nvidia cuda c programming guide. Nvidia Corp. 120(18), 8 (2011)

    Google Scholar 

  18. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)

    Article  Google Scholar 

  19. Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving gpgpu concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)

  20. Sajjapongse, K., Wang, X., Becchi, M.: A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with gpus. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, pp. 179–190. ACM (2013)

  21. Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R.: Accelerating s3d: a gpgpu case study. In: European Conference on Parallel Processing, pp. 122–131. Springer (2009)

  22. Spafford, K., Meredith, J.S., Vetter, J.S.: Quantifying numa and contention effects in multi-gpu systems. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 11. ACM (2011)

  23. Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)

    Article  Google Scholar 

  24. Tallent, N.R., Gawande, N.A., Siegel, C., Vishnu, A., Hoisie, A.: Evaluating on-node gpu interconnects for deep learning workloads. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pp. 3–21. Springer (2017)

  25. Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 193–204. IEEE (2014)

  26. Wang, C., Li, X., Zhang, J., Zhou, X., Nie, X.: Mp-tomasulo: a dependency-aware automatic parallel execution engine for sequential programs. ACM Trans. Architect. Code Optim. (TACO) 10(2), 1–26 (2013)

    Article  Google Scholar 

  27. Wang, K., Ding, X., Lee, R., Kato, S., Zhang, X.: Gdm: device memory management for gpgpu computing. ACM SIGMETRICS Perform. Eval. Rev. 42(1), 533–545 (2014)

    Article  Google Scholar 

  28. Wang, T., Oral, S., Wang, Y., Settlemyer, B., Atchley, S., Yu, W.: Burstmem: a high-performance burst buffer system for scientific applications. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 71–79. IEEE (2014)

  29. Wende, F., Steinke, T., Cordes, F.: Multi-threaded kernel offloading to gpgpu using hyper-q on kepler architecture. Zuse Institute Berlin Report, pp. 1–17 (2014)

  30. Zhou, X., Skjellum, A., Curry, M.L.: Poster: evaluating asynchrony in gibraltar raid. In: 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), pp. 1498–1498. IEEE (2012)

Download references

Acknowledgements

This work was supported by by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. 2016M3C4A7952587, 2017R1A2B4004513, and 2018R1C1B5085640), and the BK21 Plus for Pioneers in Innovative Computing (Dept. of Computer Science and Engineering, SNU) funded by National Research Foundation of Korea (NRF) (No. 21A20151113068).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyeonsang Eom.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, S., Kim, D., Son, Y. et al. Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment. Cluster Comput 23, 2261–2272 (2020). https://doi.org/10.1007/s10586-020-03105-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-020-03105-2

Keywords

Navigation