ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

Lou, Jie; Sun, Yiming; Zhang, Jie; Cao, Huawei; Zhang, Yuan; Sun, Ninghui

doi:10.1007/s42514-023-00154-y

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

Regular Paper
Published: 24 May 2023

Volume 5, pages 304–321, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Jie Lou^1,2^na1,
Yiming Sun^1,2^na1,
Jie Zhang^1,2,
Huawei Cao ORCID: orcid.org/0000-0003-1176-2521^1,3,
Yuan Zhang^1,2 &
…
Ninghui Sun^1,2

232 Accesses
2 Citations
Explore all metrics

Abstract

With the development of deep learning, hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications. A key problem in GPU cluster is how to schedule various deep learning applications, including training applications and latency-critical inference applications, to achieve optimal system performance. In cloud datacenters, inference applications often require fewer resources, and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources. Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS (Multi-Process Service). There are several problems with this execution pattern, datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications, MPS-based data sharing can lead to interaction errors between contexts, and resource contention may cause Quality of Service (QoS) violations. To solve above problems, we propose ArkGPU, a runtime system that dynamically allocates resources. ArkGPU can improve the resource utilization of the cluster, while guaranteeing the QoS of inference applications. ArkGPU is comprised of a performance predictor, a scheduler, a resource limiter, and an adjustment unit. We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU. We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27% compared to MPS. We deploy multiple applications simultaneously on ArkGPU, and in this case, goodput is improved by 94.98% compared to k8s-native and 38.65% compared to MPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 9

A Hybrid Machine Learning Model for Code Optimization

Article 22 September 2023

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Article Open access 19 January 2019

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Article Open access 12 April 2024

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., Wilkes, J.: Borg, omega, and kubernetes. Commun. ACM 59(5), 50–57 (2016)
Article Google Scholar
Chen, Q., Yang, H., Mars, J., Tang, L.: Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGPLAN Notices 51(4), 681–696 (2016)
Article Google Scholar
Chen, S., Delimitrou, C., Martínez, J.F.: Parties: Qos-aware resource partitioning for multiple interactive services. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 107–120 (2019)
cuBLAS. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. Accessed 25 Dec 2022
Duato, J., Igual, F.D., Mayo, R., Pena, A.J., Quintana-Ortí, E.S., Silla, F.: An efficient implementation of gpu virtualization in high performance clusters. In: European Conference on Parallel Processing, pp. 385–394 (2009). Springer
Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)
Article Google Scholar
Gu, J., Song, S., Li, Y., Luo, H.: Gaiagpu: sharing gpus in container clouds. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 469–476 (2018). IEEE
Hafeez, U.U., Gandhi, A.: Empirical analysis and modeling of compute times of cnn operations on aws cloud. In: 2020 IEEE International Symposium on Workload Characterization (IISWC), pp. 181–192 (2020). IEEE
Li, J., Xu, H., Zhu, Y., Liu, Z., Guo, C., Wang, C.: Aryl: an Elastic Cluster Scheduler for Deep Learning. arXiv (2022). https://arxiv.org/abs/2202.07896
Mars, J., Tang, L., Hundt, R., Skadron, K., Soffa, M.L.: Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 248–259 (2011)
Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf. Accessed 25 Dec 2022
Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A., Brown, S.D.: An introduction to decision tree modeling. J. Chemometrics 18(6), 275–285 (2004)
Article Google Scholar
NVIDIA MIG. https://www.nvidia.cn/technologies/multi-instance-gpu/. Accessed 25 Dec 2022
Nvml-api. https://docs.nvidia.com/deploy/nvml-api/index.html. Accessed 25 Dec 2022
OpenAI. https://openai.com/. Accessed 25 Dec 2022
Patel, T., Tiwari, D.: Clite: Efficient and qos-aware co-location of multiple latency-critical jobs for warehouse scale computers. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 193–206 (2020). IEEE
Reddi, V.J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., Charlebois, M., Chou, W., et al.: Mlperf inference benchmark. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 446–459 (2020). IEEE
Seber, G.A., Lee, A.J.: Linear regression analysis. Wiley, Hoboken (2012)
MATH Google Scholar
Shen, H., Chen, L., Jin, Y., Zhao, L., Kong, B., Philipose, M., Krishnamurthy, A., Sundaram, R.: Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 322–337 (2019)
Thinakaran, P., Gunasekaran, J.R., Sharma, B., Kandemir, M.T., Das, C.R.: Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–13 (2019). 10.1109/CLUSTER.2019.8891040
Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q., et al.: Gandiva: Introspective cluster scheduling for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 595–610 (2018)
Xu Z W, L.G.J., H, S.N.: Superbahn: Towards new type of cyberinfrastructure. Bull. Chin. Acad. Sci. 37(1), 46–52 (2022)
Yang, H., Breslow, A., Mars, J., Tang, L.: Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. ACM SIGARCH Comput. Architecture News 41(3), 607–618 (2013)
Article Google Scholar
Yeh, T.-A., Chen, H.-H., Chou, J.: Kubeshare: A framework to manage gpus as first-class and shared resources in container cloud. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 173–184 (2020)
Zhang, Y., Laurenzano, M.A., Mars, J., Tang, L.: Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 406–418 (2014). IEEE
Zhang, W., Chen, Q., Zheng, N., Cui, W., Fu, K., Guo, M.: Towards qos-awareness and improved utilization of spatial multitasking gpus. IEEE Trans. Comput. 71(4), 866–879 (2022)
Zhao, W., Chen, Q., Lin, H., Zhang, J., Leng, J., Li, C., Zheng, W., Li, L., Guo, M.: Themis: Predicting and reining in application-level slowdown on spatial multitasking gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 653–663 (2019). IEEE
Zhu, H., Erez, M.: Dirigent: Enforcing qos for latency-critical tasks on shared multicore systems. In: Proceedings of the Twenty-first International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 33–47 (2016)

Download references

Acknowledgements

This work was supported by National Key Research and Development Program (Grant No. 2022YFB4501404), the Beijing Natural Science Foundation (4232036), CAS Project for Youth Innovation Promotion Association.

Author information

Jie Lou and Yiming Sun contributed equally to this work.

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Jie Lou, Yiming Sun, Jie Zhang, Huawei Cao, Yuan Zhang & Ninghui Sun
University of Chinese Academy of Sciences, Beijing, 100049, China
Jie Lou, Yiming Sun, Jie Zhang, Yuan Zhang & Ninghui Sun
University of Chinese Academy of Sciences, Nanjing, 211135, China
Huawei Cao

Authors

Jie Lou
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huawei Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ninghui Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huawei Cao.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lou, J., Sun, Y., Zhang, J. et al. ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs. CCF Trans. HPC 5, 304–321 (2023). https://doi.org/10.1007/s42514-023-00154-y

Download citation

Received: 11 March 2023
Accepted: 04 May 2023
Published: 24 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s42514-023-00154-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

Abstract

Access this article

Similar content being viewed by others

A Hybrid Machine Learning Model for Code Optimization

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

Abstract

Access this article

Similar content being viewed by others

A Hybrid Machine Learning Model for Code Optimization

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation