Skip to main content
Log in

Job placement using reinforcement learning in GPU virtualization environment

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Graphics Processing Units (GPU) are widely used for high-speed processes in the computational science areas of biology, chemistry, meteorology, etc. and the machine learning areas of image and video analysis. Recently, data centers and cloud companies have adopted GPUs to provide them as computing resources. Because the majority of cloud providers allocate the GPU resource to users in an exclusive access method, the allocated GPU resource may not be all used. Although the method of allocating a GPU resource to multiple users for sharing can increase the resource utilization, performance degradation may occur in individual jobs because of interference between different jobs. It is difficult for a cloud provider to predict or control the performance of various applications executed on various cloud resources by considering their characteristics heuristically. Therefore, an intelligent job placement technique is required to minimize the interference between different jobs and increase resource utilization. This study defines the resource utilization history of applications and proposes a reinforcement learning-based job placement technique, which uses it as an input. For resource utilization history learning, a deep reinforcement learning model (DQN) is used. As a result of learning, the current resource’s state is not exceeded, and the resource is still provided by predicting which commonly placed jobs will have less impact on the total performance when executed simultaneously. This approach prevents the performance degradation of applications with diverse execution characteristics and increases the resource utilization by executing the applications while sharing the resources. The superiority of this study is demonstrated by using the proposed learning method and other methods to analyze workloads with various resource utilization characteristics. Through the experiments, it is proven that the proposed method facilitates a reduction of the total execution time and the effective use of resources, while the maintaining performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Amazon ec2. https://aws.amazon.com/ec2/

  2. Nimbix. https://www.nimbix.net/cloud-computing-nvidia/

  3. Microsoft azure. https://docs.microsoft.com/en-au/azure/virtual-machines/windows/sizes-gpu

  4. Alibaba. https://www.alibabacloud.com/ko/product/gpu

  5. Kubernetes. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

  6. Mesos. http://mesos.apache.org/documentation/latest/gpu-support/

  7. Liu, M., Li, T., Jia, N., Currid, A., Troy, V.: Understanding the virtualization” tax” of scale-out pass-through gpus in gaas clouds: An empirical study. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 259–270. IEEE (2015)

  8. Tang, X., Wang, H., Ma, X., El-Sayed, N., Zhai, J., Chen, W., Aboulnaga, A.: Spread-n-share: improving application performance and cluster throughput with resource-aware job placement. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 12. ACM (2019)

  9. Duato, J., Pena, A.J., Silla, F., Mayo, R., Quintana-Ortí, E.S.: rCUDA: Reducing the number of gpu-based accelerators in high performance clusters. In: 2010 International Conference on High Performance Computing & Simulation, pp. 224–231. IEEE (2010)

  10. Ilager, S., Wankar, R., Kune, R., Buyya, R.: Gpu paas computation model in aneka cloud computing environments. Smart Data: State-of-the-Art Perspectives in Computing and Applications, p. 19 (2019)

  11. Toosi, A.N., Sinnott, R.O., Buyya, R.: Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using aneka. Future Gener. Comput. Syst. 79, 765–775 (2018)

    Article  Google Scholar 

  12. Mps. https://docs.nvidia.com/deploy/mps/index.html

  13. Chang, C.-C., Yang, S.-R., Yeh, E.-H., Lin, P., Jeng, J.-Y.: A kubernetes-based monitoring platform for dynamic cloud resource provisioning. In: GLOBECOM 2017-2017 IEEE Global Communications Conference, pp. 1–6. IEEE (2017)

  14. Gu, J., Song, S., Li, Y., Luo, H., Gaiagpu: Sharing gpus in container clouds. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 469–476. IEEE (2018)

  15. Song, S., Deng, L., Gong, J., Luo, H.: Gaia scheduler: A kubernetes-based scheduler framework. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp. 252–259. IEEE (2018)

  16. Hong, C.-H., Spence, I., Nikolopoulos, D.S.: Fairgv: fair and fast gpu virtualization. IEEE Trans. Parallel Distrib. Syst. 28(12), 3472–3485 (2017)

    Article  Google Scholar 

  17. Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on gpus. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 193–204. IEEE (2014)

  18. Ukidave, Y., Kalra, C., Kaeli, D., Mistry, P., Schaa, D.: Runtime support for adaptive spatial partitioning and inter-kernel communication on gpus. In: 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp. 168–175. IEEE (2014)

  19. Li, X., Zhang, G., Howie Huang, H., Wang, Z., Zheng, W.: Performance analysis of gpu-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 67–76. IEEE (2016)

  20. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)

  21. Lammps. https://lammps.sandia.gov/

  22. Qmcpack. https://qmcpack.org/

  23. Phull, R., Li, C.-H., Rao, K., Cadambi, H., Chakradhar, S.: Interference-driven resource management for gpu-based heterogeneous clusters. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 109–120. ACM (2012)

  24. Nvidia gpu cloud. https://ngc.nvidia.com/

  25. Dutreilh, X., Kirgizov, S., Melekhova, O., Malenfant, J., Rivierre, N., Truck, I.: Using reinforcement learning for autonomic resource allocation in clouds: towards a fully automated workflow. In: ICAS 2011, The Seventh International Conference on Autonomic and Autonomous Systems, pp. 67–74 (2011)

  26. Barrett, E., Howley, E., Duggan, J.: Applying reinforcement learning towards automating resource allocation and application scalability in the cloud. Concurr. Comput. Pract. Exp. 25(12), 1656–1674 (2013)

    Article  Google Scholar 

  27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  28. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, (2013)

  29. deeprm. https://github.com/hongzimao/deeprm

  30. Gromacs. http://www.gromacs.org/

  31. Hoomd. http://glotzerlab.engin.umich.edu/hoomd-blue/

  32. Alibaba fake gpu. https://github.com/AliyunContainerService/gpushare-scheduler-extender

  33. Nvidia docker container. https://github.com/NVIDIA/nvidia-docker

  34. Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., Akella, A.: Multi-resource packing for cluster schedulers. In: ACM SIGCOMM Computer Communication Review, vol. 44, pp. 455–466. ACM (2014)

  35. Diab, K.M., Mustafa Rafique, M., Hefeeda, M.: Dynamic sharing of gpus in cloud systems. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 947–954. IEEE (2013)

  36. Lawall, Levin S.: J. Building stable kernel trees with machine learning

  37. 2019 usenix: Conference on operational machine learning. https://www.usenix.org/conference/opml19

  38. Rossi, F., Nardelli, M., Cardellini, V.: Horizontal and vertical scaling of container-based applications using reinforcement learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pp. 329–338. IEEE (2019)

  39. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56. ACM (2016)

  40. Bao, Y., Peng, Y., Wu, C.: Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 505–513. IEEE (2019)

  41. Xu, X., Zhang, N., Cui, M., He, M., Surana, R.: Characterization and prediction of performance interference on mediated passthrough gpus for interference-aware scheduler. In: 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), (2019)

  42. Ukidave, Y., Li, X., Kaeli, D.: Mystic: Predictive scheduling for gpu based cloud servers using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 353–362. IEEE (2016)

Download references

Acknowledgements

The authors would like to thank all students who contributed to this study. We are grateful to Qichen Chen, Sejin Kim, who assisted with evaluation.This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. NRF-2015M3C4A7065646,2017R1A2B4005681).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoonhee Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oh, J., Kim, Y. Job placement using reinforcement learning in GPU virtualization environment. Cluster Comput 23, 2219–2234 (2020). https://doi.org/10.1007/s10586-019-03044-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-03044-7

Keywords

Navigation