Workflow Improvement for KubeFlow DL Performance over Cloud-Native SmartX AI Cluster

Hong, Yujin; Kim, JongWon

doi:10.1007/978-3-030-39746-3_53

Yujin Hong⁵ &
JongWon Kim⁵

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 47))

Included in the following conference series:

International Conference on Emerging Internetworking, Data & Web Technologies

1058 Accesses

Abstract

Cloud-native Kubernetes-based orchestration is widely adopted to take advantage of building large-scale resource pools by flexibly expanding the size of pools with the insertion of additional worker nodes. To meet the emerging demand for AI (Artificial Intelligence)-inspired HPC (High Performance Computing)/HPDA (High Performance Data Analytics) workloads, versatile AI clusters driven by open-source KubeFlow software have been rapidly developed by leveraging for various ML (Machine Learning)/DL (Deep Learning) tools and frameworks. However, since the current version of KubeFlow is not fully aware of underlying GPU (Graphics Processing Unit) resources, special attention should be made to smoothly execute the ML/DL workloads. Thus, in this paper, we explore tentative options to improve the ML/DL workflow under a KubeFlow-enabled AI cluster, which focus on GPU utilization efficiency with the assistance of Prometheus open-source monitoring.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gannon, D., Barga, R., Sundaresan, N.: Cloud-native applications. IEEE Cloud Comput. 4(5), 16–21 (2017)
Article Google Scholar
Documentation. https://www.KubeFlow.org/docs/. Accessed 13 Oct 2019
Brewer, E.A.: Kubernetes and the path to cloud native. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (ACM), p. 167 (2015)
Google Scholar
Kwon, J., Kim, N.L., Kang, M., Kim, J.: Design and prototyping of container-enabled cluster for high performance data analytics. In: International Conference on Information Networking (ICOIN), pp. 436–438 (2019)
Google Scholar
Getting started | Prometheus. https://prometheus.io/docs/prometheus/latest/getting_started/. Accessed 13 Oct 2019
BeeGFS - The Leading Parallel Cluster File System. https://www.beegfs.io. Accessed 13 Oct 2019
Ceph. https://ceph.com/. Accessed 13 Oct 2019
How to Write Go Code - The Go Programming Language. https://golang.org/doc/code.html. Accessed 13 Oct 2019
Welcome to Paramiko! — Paramiko documentation. http://www.paramiko.org/. Accessed 13 Oct 2019

Download references

Acknowledgments

This work was supported by 2019 GIST Research Institute (GRI) grant funded by GIST.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, Republic of Korea
Yujin Hong & JongWon Kim

Authors

Yujin Hong
View author publications
You can also search for this author in PubMed Google Scholar
JongWon Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to JongWon Kim .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
Innovation Center for Educational Resources, University Library, Kyushu University, Fukuoka, Japan
Yoshihiro Okada
Department of Electrical Engineering and Information Technology, University of Naples "Frederico II", Naples, Italy
Flora Amato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, Y., Kim, J. (2020). Workflow Improvement for KubeFlow DL Performance over Cloud-Native SmartX AI Cluster. In: Barolli, L., Okada, Y., Amato, F. (eds) Advances in Internet, Data and Web Technologies. EIDWT 2020. Lecture Notes on Data Engineering and Communications Technologies, vol 47. Springer, Cham. https://doi.org/10.1007/978-3-030-39746-3_53

Download citation

DOI: https://doi.org/10.1007/978-3-030-39746-3_53
Published: 31 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39745-6
Online ISBN: 978-3-030-39746-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics