Abstract
High performance computing (HPC) and cloud technologies are increasingly coupled to accelerate the convergence of traditional HPC with new simulation, data analysis, machine-learning, and artificial intelligence approaches. While the HPC+cloud paradigm, or converged computing, is ushering in new scientific discoveries with unprecedented levels of workflow automation, several key mismatches between HPC and cloud technologies still preclude this paradigm from realizing its full potential. In this paper, we present a joint effort between IBM Research, Lawrence Livermore National Laboratory (LLNL), and Red Hat to address the mismatches and to bring full HPC scheduling awareness into Kubernetes, the de facto container orchestrator for cloud-native applications, which is being increasingly adopted as a key converged-computing enabler. We found Kubernetes lacking of interfaces to enable the full spectrum of converged-computing use cases in the following three areas: (A) an interface to enable HPC batch-job scheduling (e.g., locality-aware node selection), (B) an interface to enable HPC workloads or task-level scheduling, and (C) a resource co-management interface to allow HPC resource managers and Kubernetes to co-manage a resource set. We detail our methodology and present our results, whereby the advanced graph-based scheduler Fluxion – part of the open-source Flux scheduling framework – is integrated as a Kubernetes scheduler plug-in, KubeFlux. Our initial performance study shows that KubeFlux exhibits similar performance (up to measurement precision) to the default scheduler, despite KubeFlux’s considerably more sophisticated scheduling capabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
While we use “RJMS” and “scheduler” interchangeably in the paper, a scheduler is one component of an RJMS.
References
Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. Future Gener. Comput. Syst. 110, 202–213 (2020)
Cray announces Shasta software to power the Exascale Era. https://www.hpe.com/us/en/newsroom/press-release/2019/08/cray-announces-shasta-software-to-power-the-exascale-era.html. 13 Aug 2019. Hewlett Packard Enterprise (2019)
Ding, H.: Multi-scheduler in Kubernetes. https://stupefied-goodall-e282f7.netlify.app/contributors/design-proposals/scheduling/multiple-schedulers/. Accessed 20 June 2021
Flux framework: a flexible framework for resource management customized for your HPC site. http://ux-framework.org. Accessed 20 June 2021. Flux Framework Community
Fluxion: an advanced graph-based scheduler for HPC. https://github.com/ux-framework/ux-sched. Accessed 20 June 2021. Flux Framework Community
The Apache Software Foundation. Apache Mesos. http://mesos.apache.org/. Accessed 20 June 2021
Gartner, Inc.: Gartner forecasts worldwide public cloud end-user spending to grow 23% in 2021. https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-grow-23-percent-in-2021. Accessed 20 June 2021
Hyperion Research. How cloud computing is changing HPC spending. https://hyperionresearch.com/wp-content/uploads/2021/01/Hyperion-Research-Special-Analysis-Clouds-and-HPC-December-2020.pdf. Accessed 20 June 2021
IBM LSF-Kubernetes. https://github.com/IBMSpectrumComputing/lsf-kubernetes. Accessed 20 June 2021. IBM
IBM Spectrum LSF. https://www.ibm.com/. Accessed 20 June 2021. IBM
Jacobs, S.A., et al.: Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models. Int. J. High Perform. Comput. Appl. 35, 469–482 (2021)
Kube Batch. https://awesomeopensource.com/project/kubernetes-sigs/kube-batch. Accessed 20 June 2021
Volcano Community Maintainer. Volcano: collision between containers and batch computing. https://www.cncf.io/blog/2021/02/26/volcano-collision-between-containers-and-batch-computing/. Accessed 20 June 2021
Minnich, A.J., et al.: AMPL: a data-driven modeling pipeline for drug discovery. J. Chem. Inf. Model. 60(4), 1955–1968 (2020)
Node Feature Discovery. https://kubernetes-sigs.github.io/node-feature-discovery/master/get-started/index.html. Accessed 12 Sept 2021. The Kubernetes SIGs
Novella, J.A., et al.: Container-based bioinformatics with Pachyderm. Bioinformatics 35(5), 839–846 (2019)
Peterson, J.L., et al.: Merlin: enabling machine learning-ready HPC ensembles. In: CoRR abs/1912.02892 (2019)
Pod lifecycle. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/. Accessed 20 June 2021. The Kubernetes Authors
Red Hat Certified optional operator for secondary schedulers. https://github.com/openshift/secondary-scheduler-operator. 24 Sept 2021. Red Hat
Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015)
RFC 14: Canonical job specification. https://ux-framework.readthedocs.io/projects/ux-rfc/en/latest/spec_14.html. Accessed 20 June 2021. Flux Framework Community
Scheduling Framework. https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/624-scheduling-framework/README.md. Accessed 20 June 2021. The Kubernetes Authors
Schwarzkopf, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)
Sehgal, S., et al.: Topology awareness in Kubernetes part 2: don’t we already have a topology manager? https://www.openshift.com/blog/topology-awareness-in-kubernetes-part-2-dont-we-already-have-a-topology-manager. Accessed 20 June 2021. Topology-aware Scheduling Working Group
Thompson, N.C., Spanuth, S.: The decline of computers as a general purpose technology. Commun. ACM 64(3), 64–72 (2021)
User Admission Controller. https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/. Accessed 20 June 2021. The Kubernetes Authors
Vetter, J.S., et al.: Extreme heterogeneity 2018 - productive computational science in the era of extreme heterogeneity: report for DOE ASCR workshop on extreme heterogeneity (2018). https://www.osti.gov/biblio/1473756. https://doi.org/10.2172/1473756
Volcano Kubernetes Native Batch System. https://volcano.sh. Accessed 20 June 2021
Wang, K., et al.: Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. HPDC, Portland, Oregon, USA, pp. 219–222 (2015)
PBS Works. Kubernetes connector for PBS professional. https://github.com/PBSPro/kubernetes-pbspro-connector. Accessed 20 June 2021
Yang, W., et al.: YuniKorn: a universal resources scheduler. https://blog.cloudera.com/yunikorn-a-universal-resources-scheduler. Accessed 20 June 2021. Cloudera
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Zhou, N., et al.: Container orchestration on HPC systems through Kubernetes. J. Cloud Comput. 10(1), 16 (2021)
Acknowledgements
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-823344).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Misale, C. et al. (2022). Towards Standard Kubernetes Scheduling Interfaces for Converged Computing. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-96498-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96497-9
Online ISBN: 978-3-030-96498-6
eBook Packages: Computer ScienceComputer Science (R0)