Towards Standard Kubernetes Scheduling Interfaces for Converged Computing

Misale, Claudia; Milroy, Daniel J.; Gutierrez, Carlos Eduardo Arango; Drocco, Maurizio; Herbein, Stephen; Ahn, Dong H.; Kaiser, Zvonko; Park, Yoonho

doi:10.1007/978-3-030-96498-6_18

Claudia Misale¹²,
Daniel J. Milroy¹³,
Carlos Eduardo Arango Gutierrez¹⁴,
Maurizio Drocco¹²,
Stephen Herbein¹³,
Dong H. Ahn¹³,
Zvonko Kaiser¹⁴ &
…
Yoonho Park¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1512))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

1367 Accesses
5 Citations

Abstract

High performance computing (HPC) and cloud technologies are increasingly coupled to accelerate the convergence of traditional HPC with new simulation, data analysis, machine-learning, and artificial intelligence approaches. While the HPC+cloud paradigm, or converged computing, is ushering in new scientific discoveries with unprecedented levels of workflow automation, several key mismatches between HPC and cloud technologies still preclude this paradigm from realizing its full potential. In this paper, we present a joint effort between IBM Research, Lawrence Livermore National Laboratory (LLNL), and Red Hat to address the mismatches and to bring full HPC scheduling awareness into Kubernetes, the de facto container orchestrator for cloud-native applications, which is being increasingly adopted as a key converged-computing enabler. We found Kubernetes lacking of interfaces to enable the full spectrum of converged-computing use cases in the following three areas: (A) an interface to enable HPC batch-job scheduling (e.g., locality-aware node selection), (B) an interface to enable HPC workloads or task-level scheduling, and (C) a resource co-management interface to allow HPC resource managers and Kubernetes to co-manage a resource set. We detail our methodology and present our results, whereby the advanced graph-based scheduler Fluxion – part of the open-source Flux scheduling framework – is integrated as a Kubernetes scheduler plug-in, KubeFlux. Our initial performance study shows that KubeFlux exhibits similar performance (up to measurement precision) to the default scheduler, despite KubeFlux’s considerably more sophisticated scheduling capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HPC, Cloud and Big-Data Convergent Architectures: The LEXIS Approach

Dynamic Job Allocation on Federated Cloud-HPC Environments

Using Kubernetes in Academic Environment: Problems and Approaches

Notes

1.
While we use “RJMS” and “scheduler” interchangeably in the paper, a scheduler is one component of an RJMS.

References

Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. Future Gener. Comput. Syst. 110, 202–213 (2020)
Article Google Scholar
Cray announces Shasta software to power the Exascale Era. https://www.hpe.com/us/en/newsroom/press-release/2019/08/cray-announces-shasta-software-to-power-the-exascale-era.html. 13 Aug 2019. Hewlett Packard Enterprise (2019)
Ding, H.: Multi-scheduler in Kubernetes. https://stupefied-goodall-e282f7.netlify.app/contributors/design-proposals/scheduling/multiple-schedulers/. Accessed 20 June 2021
Flux framework: a flexible framework for resource management customized for your HPC site. http://ux-framework.org. Accessed 20 June 2021. Flux Framework Community
Fluxion: an advanced graph-based scheduler for HPC. https://github.com/ux-framework/ux-sched. Accessed 20 June 2021. Flux Framework Community
The Apache Software Foundation. Apache Mesos. http://mesos.apache.org/. Accessed 20 June 2021
Gartner, Inc.: Gartner forecasts worldwide public cloud end-user spending to grow 23% in 2021. https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-grow-23-percent-in-2021. Accessed 20 June 2021
Hyperion Research. How cloud computing is changing HPC spending. https://hyperionresearch.com/wp-content/uploads/2021/01/Hyperion-Research-Special-Analysis-Clouds-and-HPC-December-2020.pdf. Accessed 20 June 2021
IBM LSF-Kubernetes. https://github.com/IBMSpectrumComputing/lsf-kubernetes. Accessed 20 June 2021. IBM
IBM Spectrum LSF. https://www.ibm.com/. Accessed 20 June 2021. IBM
Jacobs, S.A., et al.: Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models. Int. J. High Perform. Comput. Appl. 35, 469–482 (2021)
Article Google Scholar
Kube Batch. https://awesomeopensource.com/project/kubernetes-sigs/kube-batch. Accessed 20 June 2021
Volcano Community Maintainer. Volcano: collision between containers and batch computing. https://www.cncf.io/blog/2021/02/26/volcano-collision-between-containers-and-batch-computing/. Accessed 20 June 2021
Minnich, A.J., et al.: AMPL: a data-driven modeling pipeline for drug discovery. J. Chem. Inf. Model. 60(4), 1955–1968 (2020)
Article Google Scholar
Node Feature Discovery. https://kubernetes-sigs.github.io/node-feature-discovery/master/get-started/index.html. Accessed 12 Sept 2021. The Kubernetes SIGs
Novella, J.A., et al.: Container-based bioinformatics with Pachyderm. Bioinformatics 35(5), 839–846 (2019)
Article MathSciNet Google Scholar
Peterson, J.L., et al.: Merlin: enabling machine learning-ready HPC ensembles. In: CoRR abs/1912.02892 (2019)
Google Scholar
Pod lifecycle. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/. Accessed 20 June 2021. The Kubernetes Authors
Red Hat Certified optional operator for secondary schedulers. https://github.com/openshift/secondary-scheduler-operator. 24 Sept 2021. Red Hat
Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015)
Article Google Scholar
RFC 14: Canonical job specification. https://ux-framework.readthedocs.io/projects/ux-rfc/en/latest/spec_14.html. Accessed 20 June 2021. Flux Framework Community
Scheduling Framework. https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/624-scheduling-framework/README.md. Accessed 20 June 2021. The Kubernetes Authors
Schwarzkopf, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)
Google Scholar
Sehgal, S., et al.: Topology awareness in Kubernetes part 2: don’t we already have a topology manager? https://www.openshift.com/blog/topology-awareness-in-kubernetes-part-2-dont-we-already-have-a-topology-manager. Accessed 20 June 2021. Topology-aware Scheduling Working Group
Thompson, N.C., Spanuth, S.: The decline of computers as a general purpose technology. Commun. ACM 64(3), 64–72 (2021)
Article Google Scholar
User Admission Controller. https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/. Accessed 20 June 2021. The Kubernetes Authors
Vetter, J.S., et al.: Extreme heterogeneity 2018 - productive computational science in the era of extreme heterogeneity: report for DOE ASCR workshop on extreme heterogeneity (2018). https://www.osti.gov/biblio/1473756. https://doi.org/10.2172/1473756
Volcano Kubernetes Native Batch System. https://volcano.sh. Accessed 20 June 2021
Wang, K., et al.: Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. HPDC, Portland, Oregon, USA, pp. 219–222 (2015)
Google Scholar
PBS Works. Kubernetes connector for PBS professional. https://github.com/PBSPro/kubernetes-pbspro-connector. Accessed 20 June 2021
Yang, W., et al.: YuniKorn: a universal resources scheduler. https://blog.cloudera.com/yunikorn-a-universal-resources-scheduler. Accessed 20 June 2021. Cloudera
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Chapter Google Scholar
Zhou, N., et al.: Container orchestration on HPC systems through Kubernetes. J. Cloud Comput. 10(1), 16 (2021)
Article Google Scholar

Download references

Acknowledgements

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-823344).

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Claudia Misale, Maurizio Drocco & Yoonho Park
Lawrence Livermore National Laboratory, Livermore, CA, USA
Daniel J. Milroy, Stephen Herbein & Dong H. Ahn
Red Hat, Raleigh, NC, USA
Carlos Eduardo Arango Gutierrez & Zvonko Kaiser

Authors

Claudia Misale
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Milroy
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Eduardo Arango Gutierrez
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Drocco
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Herbein
View author publications
You can also search for this author in PubMed Google Scholar
Dong H. Ahn
View author publications
You can also search for this author in PubMed Google Scholar
Zvonko Kaiser
View author publications
You can also search for this author in PubMed Google Scholar
Yoonho Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong H. Ahn .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Jeffrey Nichols
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Arthur ‘Barney’ Maccabe
Oak Ridge National Laboratory, Oak Ridge, TN, USA
James Nutaro
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Pravallika Devineni
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Theresa Ahearn
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Becky Verastegui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Misale, C. et al. (2022). Towards Standard Kubernetes Scheduling Interfaces for Converged Computing. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-96498-6_18
Published: 10 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96497-9
Online ISBN: 978-3-030-96498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics