Skip to main content
Log in

KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the increasing number of new containerized applications, such as high performance and deep learning applications, started to reply on GPU, efficiently supporting GPU in container cloud becomes essential. While GPU sharing has been extensively studied for VM, limited work has been done for containers. Existing works only use a single specific GPU virtualization technique to deploy containers, like GPU pass-through or API forwarding, and lack remote GPU virtualization optimization. The limitations lead to low system throughput and container performance degradation due to the dynamic and heterogeneous nature of container resource requirement and GPU virtualization technique, and the problem of communication overhead and resource racing. Therefore, we designed and implemented KubeGPU, which extends Kubernetes to enable GPU sharing with adaptive share strategy. Adaptive sharing strategy gives KubeGPU the ability to make a dynamic choice of GPU virtualization to deploy containers according to available GPU resources and containers’ configuration parameters such as GPU resource requirement in order to achieve a good container performance and system throughput. Besides that, network-aware scheduling approach and fine-grained allocation of remote GPU resources are proposed to optimize remote GPU virtualization. Finally, using representative real-world workloads for HPC and deep learning, we demonstrate the superiority of KubeGPU compared to other existing works, and the effectiveness of KubeGPU in minimizing communication overhead and eliminating remote GPU resource racing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. The details of network mode and network device are described in Sect. 2.3.

  2. Transparency means that KubeGPU can be seamlessly integrated with other Kubernetes components to optimize remote GPU virtualization. Therefore, KubeGPU allocates containers with available and better network performing network modes instead of replacing the installed plugin.

  3. An idle GPU Device Resource means that the GPU has no container scheduled on it.

  4. Kubeshare is incompatible with GROMACS; therefore, it introduces more performance overhead than GaiaGPU. Similarly, GaiaGPU is incompatible with Pytorch MNIST. In this case, the incompatibility means that the resource allocation strategy of GPU management framework cannot meet the resource demand of container application.

References

  1. Al Jawarneh IM, Bellavista P, Bosi F, Foschini L, Martuscelli G, Montanari R, Palopoli A (2019) Container orchestration engines: a thorough functional and performance comparison. In: ICC 2019-2019 IEEE International Conference on Communications (ICC), pp 1–6. IEEE

  2. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, 11:22–22

  3. Red Hat OpenShift makes container orchestration easier (2021) https://www.redhat.com/en/technologies/cloud-computing/openshift. Accessed 29 Mar

  4. Swarm mode overview (2021) https://www.redhat.com/en/technologies/cloud-computing/openshift. Accessed 29 Mar

  5. Kubernetes (2021) https://github.com/kubernetes/kubernetes. Accessed 18 Nov

  6. Altintas I, Marcus K, Nealey I, Sellars SL, Graham J, Mishin D, Polizzi J, Crawl D, DeFanti T, Smarr L (2019) Workflow-driven distributed machine learning in CHASE-CI: a cognitive hardware and software ecosystem community infrastructure. In: 2019 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp 865–873. IEEE

  7. Managing Resources for Containers (2021) https://kubernetes.io/docs/concepts/configuration/manage-resources-containers. Accessed 22 Oct

  8. Yoon DH, Han Y (2020) Parallel power flow computation trends and applications: a review focusing on gpu. Energies 13(9):2147

    Article  Google Scholar 

  9. Hong CH, Spence I, Nikolopoulos DS (2017) GPU virtualization and scheduling methods: a comprehensive survey. ACM Comput Surv (CSUR) 50(3):1–37

    Article  Google Scholar 

  10. Naranjo DM, Risco S, de Alfonso C, Pérez A, Blanquer I, Moltó G (2020) Accelerated serverless computing based on GPU virtualization. J Parallel Distrib Comput 139:32–42

    Article  Google Scholar 

  11. Silla F, Prades J, Iserte S, Reano C (2016) Remote GPU virtualization: Is it useful?. In: 2016 2nd IEEE international workshop on high-performance interconnection networks in the exascale and big-data era (HiPINEB), pp 41–48. IEEE

  12. Thinakaran P, Gunasekaran JR, Sharma B, Kandemir MT, Das CR (2019) Kube-knots: resource harvesting through dynamic container orchestration in gpu-based datacenters. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 1–13

    Google Scholar 

  13. Lu Q, Yao J, Guan H, Gao P (2019) gQoS: a QoS-oriented GPU virtualization with adaptive capacity sharing. IEEE Trans Parallel Distrib Syst 31(4):843–855

    Article  Google Scholar 

  14. Gonzalez NM, Elengikal T (2021) Transparent i/o-aware gpu virtualization for efficient resource consolidation. In: 2021 IEEE international parallel and distributed processing symposium (IPDPS), pp 131–140. IEEE

  15. Tang D, Li L, Ma J, Liu X, Qi Z, Guan H (2021) gremote: cloud rendering on gpu resource pool based on api-forwarding. J Syst Archit 116:102055

    Article  Google Scholar 

  16. Song S, Deng L, Gong J, Luo H (2018) Gaia scheduler: a kubernetes-based scheduler framework. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 252–259

    Chapter  Google Scholar 

  17. cGPU overview (2021) https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/cgpu-overview. Accessed 22 Sep

  18. k8s-device-plugin (2021) https://kubernetes.io/docs/concepts/configuration/manage-resources-containers. Accessed 13 Nov

  19. Reaño C, Silla F, Shainer G, Schultz S (2015) Local and Remote gpus Perform Similar with edr 100g Infiniband. In: Proceedings of the Industrial Track of the 16th International Middleware Conference, pp 1–7

  20. Reaño C, Silla F (2016) Reducing the performance gap of remote gpu virtualization with infiniband connect-ib. In: 2016 IEEE symposium on computers and communication (ISCC), pp 920–925. IEEE

  21. Reaño C, Silla F (2017) A Comparative Performance Analysis of Remote gpu Virtualization Over Three Generations of gpus. In: 2017 46th International Conference on Parallel Processing Workshops (ICPPW), pp. 121–128. IEEE

  22. Qi S, Kulkarni SG, Ramakrishnan K (2020) Understanding container network interface plugins: design considerations and performance. In: 2020 IEEE international symposium on local and metropolitan area networks (LANMAN, pp 1–6. IEEE

  23. Xu C, Rajamani K, Felter W (2018) Nbwguard: Realizing Network qos for Kubernetes. In: Proceedings of the 19th International Middleware Conference Industry, pp 32–38

  24. Deepomatic (2021) https://github.com/Deepomatic/shared-gpu-nvidia-k8s-device-plugin. Accessed 12 Oct

  25. gpushare-scheduler-extender (2021) https://github.com/AliyunContainerService/gpushare-scheduler-extender Accessed 4 Nov

  26. Kang D, Jun TJ, Kim D, Kim J, Kim D (2017) Convgpu: Gpu Management Middleware in Container Based Virtualized Environment. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 301–309

    Chapter  Google Scholar 

  27. Yeh TA, Chen HH, Chou J (2020) KubeShare: a framework to manage GPUs as first-class and shared resources in container cloud. In: Proceedings of the 29th international symposium on high-performance parallel and distributed computing, pp 173–184

  28. Chiang MC, Chou J (2021) DynamoML: dynamic resource management operators for machine learning Workloads. In: CLOSER, pp 122–132

  29. Satzke K, Akkus IE, Chen R, Rimac I, Stein M, Beck A, Aditya P, Vanga M, Hilt V (2020) Efficient GPU sharing for serverless workflows. In: Proceedings of the 1st workshop on high performance serverless computing, pp 17–24

  30. Vinoski S (2002) Chain of responsibility. IEEE Internet Comput 6(6):80–83

    Article  Google Scholar 

  31. Single Root I/O Virtualization and Sharing Specification Revision 1.1. https://members.pcisig.com/wg/PCI-SIG/document/download/8238. Accessed 20 Jan, 2010

  32. Kang J, Lim J, Yu H (2020) Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC-based GPU Virtualization. Softw: Pract Exp 50(6):948–972

    Google Scholar 

  33. Xiao S, Balaji P, Zhu Q, Thakur R, Coghlan S, Lin H, Wen G, Hong J, Feng W-c (2012) Vocl: an optimized environment for transparent virtualization of graphics processing units. In: 2012 innovative parallel computing (InPar), pp 1–12. IEEE

  34. Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1:19–25

    Article  Google Scholar 

  35. Paszke A, Gross S, Massa F et al (2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (NeurIPS 2019), pp 8024–8035

  36. Vouzis PD, Sahinidis NV (2011) GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics 27(2):182–188

    Article  Google Scholar 

  37. Anderson JA, Glaser J, Glotzer SC (2020) Hoomd-blue: a python package for high-performance molecular dynamics and hard particle monte carlo simulations. Comput Mater Sci 173:109363

    Article  Google Scholar 

  38. Liu Z, Chen C, Li J, Cheng Y, Kou Y, Zhang D (2022) Kubfbs: a fine-grained and balance-aware scheduling system for deep learning tasks based on kubernetes. Concurr Comput: Pract Exp 34(11):6836

    Article  Google Scholar 

  39. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp 265–283

  40. Reaño C, Prades J, Silla F (2018) Exploring the use of remote gpu virtualization in low-power systems for bioinformatics applications. In: Proceedings of the 47th International Conference on Parallel Processing Companion, pp 1–8

Download references

Acknowledgements

This work was supported by Shanghai Engineering Research Center of Intelligent Computing System, Shanghai University(Grant number 19DZ2252600), The National Key Research and Development Program of China (No. 2018YFB0704400), Key Program of Science and Technology of Yunnan Province(No. 202002AB080001-2),the Grand Joint Projects of Shanghai University(Grant number 202124) and GHfund B (Grand No. 20210702).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengsen Liu.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, W., Liu, Z., Tan, Y. et al. KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud. J Supercomput 79, 591–625 (2023). https://doi.org/10.1007/s11227-022-04682-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04682-2

Keywords

Navigation