Abstract
General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.
Similar content being viewed by others
Notes
The current execution of two lmi application is N/A since the number of GPGPU cores are not enough to run two lmi applications
References
Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for gpgpu spatial multitasking. In: 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)
Becchi, M., Sajjapongse, K., Graves, I., Procter, A., Ravi, V., Chakradhar, S.: A virtual memory based runtime to support multi-tenancy in clusters with gpus. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, pp. 97–108. ACM (2012)
Belviranli, M.E., Khorasani, F., Bhuyan, L.N., Gupta, R.: Cumas: data transfer aware multi-application scheduling for shared gpus. In: Proceedings of the 2016 International Conference on Supercomputing, p. 31. ACM (2016)
Bradley, T.: Gpu Performance Analysis and Optimisation. NVIDIA Corporation, Santa Clara (2012)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54. IEEE (2009)
Cuomo, S., Galletti, A., Marcellino, L., Navarra, G., Toraldo, G.: On gpu-cuda as preprocessing of fuzzy-rough data reduction by means of singular value decomposition. Soft Comput. 22(5), 1525–1532 (2018)
Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Dimitrov, M., Mantor, M., Zhou, H.: Understanding software approaches for gpgpu reliability. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 94–104. ACM (2009)
Elliott, G.A., Ward, B.C., Anderson, J.H.: Gpusync: a framework for real-time gpu management. In: 2013 IEEE 34th Real-Time Systems Symposium, pp. 33–44. IEEE (2013)
Gharaibeh, A., Ripeanu, M.: Size matters: Space/time tradeoffs to improve gpgpu applications performance. In: 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2010)
Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A gpgpu transparent virtualization component for high performance computing clouds. In: European Conference on Parallel Processing, pp. 379–391. Springer (2010)
Han, T.D., Abdelrahman, T.S.: hicuda: high-level gpgpu programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)
Huang, S., Xiao, S., Feng, W.c.: On the energy efficiency of graphics processing units for scientific computing. In: IEEE International Symposium on Parallel & Distributed Processing, 2009. IPDPS 2009, pp. 1–8. IEEE (2009)
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for cpu-gpu architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 165–174. ACM (2012)
Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in cuda. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp. 1–6. IEEE (2014)
Lee, S.Y., Wu, C.J.: Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference. In: 2017 IEEE International Symposium on Workload Characterization (IISWC), pp. 43–53. IEEE (2017)
Nvidia, C.: Nvidia cuda c programming guide. Nvidia Corp. 120(18), 8 (2011)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving gpgpu concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)
Sajjapongse, K., Wang, X., Becchi, M.: A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with gpus. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, pp. 179–190. ACM (2013)
Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R.: Accelerating s3d: a gpgpu case study. In: European Conference on Parallel Processing, pp. 122–131. Springer (2009)
Spafford, K., Meredith, J.S., Vetter, J.S.: Quantifying numa and contention effects in multi-gpu systems. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 11. ACM (2011)
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)
Tallent, N.R., Gawande, N.A., Siegel, C., Vishnu, A., Hoisie, A.: Evaluating on-node gpu interconnects for deep learning workloads. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pp. 3–21. Springer (2017)
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 193–204. IEEE (2014)
Wang, C., Li, X., Zhang, J., Zhou, X., Nie, X.: Mp-tomasulo: a dependency-aware automatic parallel execution engine for sequential programs. ACM Trans. Architect. Code Optim. (TACO) 10(2), 1–26 (2013)
Wang, K., Ding, X., Lee, R., Kato, S., Zhang, X.: Gdm: device memory management for gpgpu computing. ACM SIGMETRICS Perform. Eval. Rev. 42(1), 533–545 (2014)
Wang, T., Oral, S., Wang, Y., Settlemyer, B., Atchley, S., Yu, W.: Burstmem: a high-performance burst buffer system for scientific applications. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 71–79. IEEE (2014)
Wende, F., Steinke, T., Cordes, F.: Multi-threaded kernel offloading to gpgpu using hyper-q on kepler architecture. Zuse Institute Berlin Report, pp. 1–17 (2014)
Zhou, X., Skjellum, A., Curry, M.L.: Poster: evaluating asynchrony in gibraltar raid. In: 2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), pp. 1498–1498. IEEE (2012)
Acknowledgements
This work was supported by by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. 2016M3C4A7952587, 2017R1A2B4004513, and 2018R1C1B5085640), and the BK21 Plus for Pioneers in Innovative Computing (Dept. of Computer Science and Engineering, SNU) funded by National Research Foundation of Korea (NRF) (No. 21A20151113068).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, S., Kim, D., Son, Y. et al. Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment. Cluster Comput 23, 2261–2272 (2020). https://doi.org/10.1007/s10586-020-03105-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-020-03105-2