Job placement using reinforcement learning in GPU virtualization environment

Oh, Jisun; Kim, Yoonhee

doi:10.1007/s10586-019-03044-7

Job placement using reinforcement learning in GPU virtualization environment

Published: 09 January 2020

Volume 23, pages 2219–2234, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Jisun Oh¹ &
Yoonhee Kim¹

616 Accesses
4 Citations
Explore all metrics

Abstract

Graphics Processing Units (GPU) are widely used for high-speed processes in the computational science areas of biology, chemistry, meteorology, etc. and the machine learning areas of image and video analysis. Recently, data centers and cloud companies have adopted GPUs to provide them as computing resources. Because the majority of cloud providers allocate the GPU resource to users in an exclusive access method, the allocated GPU resource may not be all used. Although the method of allocating a GPU resource to multiple users for sharing can increase the resource utilization, performance degradation may occur in individual jobs because of interference between different jobs. It is difficult for a cloud provider to predict or control the performance of various applications executed on various cloud resources by considering their characteristics heuristically. Therefore, an intelligent job placement technique is required to minimize the interference between different jobs and increase resource utilization. This study defines the resource utilization history of applications and proposes a reinforcement learning-based job placement technique, which uses it as an input. For resource utilization history learning, a deep reinforcement learning model (DQN) is used. As a result of learning, the current resource’s state is not exceeded, and the resource is still provided by predicting which commonly placed jobs will have less impact on the total performance when executed simultaneously. This approach prevents the performance degradation of applications with diverse execution characteristics and increases the resource utilization by executing the applications while sharing the resources. The superiority of this study is demonstrated by using the proposed learning method and other methods to analyze workloads with various resource utilization characteristics. Through the experiments, it is proven that the proposed method facilitates a reduction of the total execution time and the effective use of resources, while the maintaining performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-aware job scheduling for cloud instances using deep reinforcement learning

Article 16 October 2021

Cost-aware real-time job scheduling for hybrid cloud using deep reinforcement learning

Article 19 June 2022

Dynamic Resource Management for Machine Learning Pipeline Workloads

Article 30 August 2023

References

Amazon ec2. https://aws.amazon.com/ec2/
Nimbix. https://www.nimbix.net/cloud-computing-nvidia/
Microsoft azure. https://docs.microsoft.com/en-au/azure/virtual-machines/windows/sizes-gpu
Alibaba. https://www.alibabacloud.com/ko/product/gpu
Kubernetes. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Mesos. http://mesos.apache.org/documentation/latest/gpu-support/
Liu, M., Li, T., Jia, N., Currid, A., Troy, V.: Understanding the virtualization” tax” of scale-out pass-through gpus in gaas clouds: An empirical study. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 259–270. IEEE (2015)
Tang, X., Wang, H., Ma, X., El-Sayed, N., Zhai, J., Chen, W., Aboulnaga, A.: Spread-n-share: improving application performance and cluster throughput with resource-aware job placement. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 12. ACM (2019)
Duato, J., Pena, A.J., Silla, F., Mayo, R., Quintana-Ortí, E.S.: rCUDA: Reducing the number of gpu-based accelerators in high performance clusters. In: 2010 International Conference on High Performance Computing & Simulation, pp. 224–231. IEEE (2010)
Ilager, S., Wankar, R., Kune, R., Buyya, R.: Gpu paas computation model in aneka cloud computing environments. Smart Data: State-of-the-Art Perspectives in Computing and Applications, p. 19 (2019)
Toosi, A.N., Sinnott, R.O., Buyya, R.: Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using aneka. Future Gener. Comput. Syst. 79, 765–775 (2018)
Article Google Scholar
Mps. https://docs.nvidia.com/deploy/mps/index.html
Chang, C.-C., Yang, S.-R., Yeh, E.-H., Lin, P., Jeng, J.-Y.: A kubernetes-based monitoring platform for dynamic cloud resource provisioning. In: GLOBECOM 2017-2017 IEEE Global Communications Conference, pp. 1–6. IEEE (2017)
Gu, J., Song, S., Li, Y., Luo, H., Gaiagpu: Sharing gpus in container clouds. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 469–476. IEEE (2018)
Song, S., Deng, L., Gong, J., Luo, H.: Gaia scheduler: A kubernetes-based scheduler framework. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 252–259. IEEE (2018)
Hong, C.-H., Spence, I., Nikolopoulos, D.S.: Fairgv: fair and fast gpu virtualization. IEEE Trans. Parallel Distrib. Syst. 28(12), 3472–3485 (2017)
Article Google Scholar
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on gpus. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 193–204. IEEE (2014)
Ukidave, Y., Kalra, C., Kaeli, D., Mistry, P., Schaa, D.: Runtime support for adaptive spatial partitioning and inter-kernel communication on gpus. In: 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp. 168–175. IEEE (2014)
Li, X., Zhang, G., Howie Huang, H., Wang, Z., Zheng, W.: Performance analysis of gpu-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 67–76. IEEE (2016)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
Lammps. https://lammps.sandia.gov/
Qmcpack. https://qmcpack.org/
Phull, R., Li, C.-H., Rao, K., Cadambi, H., Chakradhar, S.: Interference-driven resource management for gpu-based heterogeneous clusters. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 109–120. ACM (2012)
Nvidia gpu cloud. https://ngc.nvidia.com/
Dutreilh, X., Kirgizov, S., Melekhova, O., Malenfant, J., Rivierre, N., Truck, I.: Using reinforcement learning for autonomic resource allocation in clouds: towards a fully automated workflow. In: ICAS 2011, The Seventh International Conference on Autonomic and Autonomous Systems, pp. 67–74 (2011)
Barrett, E., Howley, E., Duggan, J.: Applying reinforcement learning towards automating resource allocation and application scalability in the cloud. Concurr. Comput. Pract. Exp. 25(12), 1656–1674 (2013)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, (2013)
deeprm. https://github.com/hongzimao/deeprm
Gromacs. http://www.gromacs.org/
Hoomd. http://glotzerlab.engin.umich.edu/hoomd-blue/
Alibaba fake gpu. https://github.com/AliyunContainerService/gpushare-scheduler-extender
Nvidia docker container. https://github.com/NVIDIA/nvidia-docker
Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., Akella, A.: Multi-resource packing for cluster schedulers. In: ACM SIGCOMM Computer Communication Review, vol. 44, pp. 455–466. ACM (2014)
Diab, K.M., Mustafa Rafique, M., Hefeeda, M.: Dynamic sharing of gpus in cloud systems. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 947–954. IEEE (2013)
Lawall, Levin S.: J. Building stable kernel trees with machine learning
2019 usenix: Conference on operational machine learning. https://www.usenix.org/conference/opml19
Rossi, F., Nardelli, M., Cardellini, V.: Horizontal and vertical scaling of container-based applications using reinforcement learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pp. 329–338. IEEE (2019)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56. ACM (2016)
Bao, Y., Peng, Y., Wu, C.: Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 505–513. IEEE (2019)
Xu, X., Zhang, N., Cui, M., He, M., Surana, R.: Characterization and prediction of performance interference on mediated passthrough gpus for interference-aware scheduler. In: 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), (2019)
Ukidave, Y., Li, X., Kaeli, D.: Mystic: Predictive scheduling for gpu based cloud servers using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 353–362. IEEE (2016)

Download references

Acknowledgements

The authors would like to thank all students who contributed to this study. We are grateful to Qichen Chen, Sejin Kim, who assisted with evaluation.This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. NRF-2015M3C4A7065646,2017R1A2B4005681).

Author information

Authors and Affiliations

Sookmyung Women’s University, Seoul, South Korea
Jisun Oh & Yoonhee Kim

Authors

Jisun Oh
View author publications
You can also search for this author in PubMed Google Scholar
Yoonhee Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoonhee Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oh, J., Kim, Y. Job placement using reinforcement learning in GPU virtualization environment. Cluster Comput 23, 2219–2234 (2020). https://doi.org/10.1007/s10586-019-03044-7

Download citation

Received: 25 December 2019
Revised: 25 December 2019
Accepted: 31 December 2019
Published: 09 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10586-019-03044-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Job placement using reinforcement learning in GPU virtualization environment

Abstract

Access this article

Similar content being viewed by others

Cost-aware job scheduling for cloud instances using deep reinforcement learning

Cost-aware real-time job scheduling for hybrid cloud using deep reinforcement learning

Dynamic Resource Management for Machine Learning Pipeline Workloads

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Job placement using reinforcement learning in GPU virtualization environment

Abstract

Access this article

Similar content being viewed by others

Cost-aware job scheduling for cloud instances using deep reinforcement learning

Cost-aware real-time job scheduling for hybrid cloud using deep reinforcement learning

Dynamic Resource Management for Machine Learning Pipeline Workloads

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation