Abstract
Cloud-native Kubernetes-based orchestration is widely adopted to take advantage of building large-scale resource pools by flexibly expanding the size of pools with the insertion of additional worker nodes. To meet the emerging demand for AI (Artificial Intelligence)-inspired HPC (High Performance Computing)/HPDA (High Performance Data Analytics) workloads, versatile AI clusters driven by open-source KubeFlow software have been rapidly developed by leveraging for various ML (Machine Learning)/DL (Deep Learning) tools and frameworks. However, since the current version of KubeFlow is not fully aware of underlying GPU (Graphics Processing Unit) resources, special attention should be made to smoothly execute the ML/DL workloads. Thus, in this paper, we explore tentative options to improve the ML/DL workflow under a KubeFlow-enabled AI cluster, which focus on GPU utilization efficiency with the assistance of Prometheus open-source monitoring.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gannon, D., Barga, R., Sundaresan, N.: Cloud-native applications. IEEE Cloud Comput. 4(5), 16–21 (2017)
Documentation. https://www.KubeFlow.org/docs/. Accessed 13 Oct 2019
Brewer, E.A.: Kubernetes and the path to cloud native. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (ACM), p. 167 (2015)
Kwon, J., Kim, N.L., Kang, M., Kim, J.: Design and prototyping of container-enabled cluster for high performance data analytics. In: International Conference on Information Networking (ICOIN), pp. 436–438 (2019)
Getting started | Prometheus. https://prometheus.io/docs/prometheus/latest/getting_started/. Accessed 13 Oct 2019
BeeGFS - The Leading Parallel Cluster File System. https://www.beegfs.io. Accessed 13 Oct 2019
Ceph. https://ceph.com/. Accessed 13 Oct 2019
How to Write Go Code - The Go Programming Language. https://golang.org/doc/code.html. Accessed 13 Oct 2019
Welcome to Paramiko! — Paramiko documentation. http://www.paramiko.org/. Accessed 13 Oct 2019
Acknowledgments
This work was supported by 2019 GIST Research Institute (GRI) grant funded by GIST.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Hong, Y., Kim, J. (2020). Workflow Improvement for KubeFlow DL Performance over Cloud-Native SmartX AI Cluster. In: Barolli, L., Okada, Y., Amato, F. (eds) Advances in Internet, Data and Web Technologies. EIDWT 2020. Lecture Notes on Data Engineering and Communications Technologies, vol 47. Springer, Cham. https://doi.org/10.1007/978-3-030-39746-3_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-39746-3_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39745-6
Online ISBN: 978-3-030-39746-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)