Skip to main content

Abstract

High performance computing (HPC) and cloud technologies are increasingly coupled to accelerate the convergence of traditional HPC with new simulation, data analysis, machine-learning, and artificial intelligence approaches. While the HPC+cloud paradigm, or converged computing, is ushering in new scientific discoveries with unprecedented levels of workflow automation, several key mismatches between HPC and cloud technologies still preclude this paradigm from realizing its full potential. In this paper, we present a joint effort between IBM Research, Lawrence Livermore National Laboratory (LLNL), and Red Hat to address the mismatches and to bring full HPC scheduling awareness into Kubernetes, the de facto container orchestrator for cloud-native applications, which is being increasingly adopted as a key converged-computing enabler. We found Kubernetes lacking of interfaces to enable the full spectrum of converged-computing use cases in the following three areas: (A) an interface to enable HPC batch-job scheduling (e.g., locality-aware node selection), (B) an interface to enable HPC workloads or task-level scheduling, and (C) a resource co-management interface to allow HPC resource managers and Kubernetes to co-manage a resource set. We detail our methodology and present our results, whereby the advanced graph-based scheduler Fluxion – part of the open-source Flux scheduling framework – is integrated as a Kubernetes scheduler plug-in, KubeFlux. Our initial performance study shows that KubeFlux exhibits similar performance (up to measurement precision) to the default scheduler, despite KubeFlux’s considerably more sophisticated scheduling capabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    While we use “RJMS” and “scheduler” interchangeably in the paper, a scheduler is one component of an RJMS.

References

  1. Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. Future Gener. Comput. Syst. 110, 202–213 (2020)

    Article  Google Scholar 

  2. Cray announces Shasta software to power the Exascale Era. https://www.hpe.com/us/en/newsroom/press-release/2019/08/cray-announces-shasta-software-to-power-the-exascale-era.html. 13 Aug 2019. Hewlett Packard Enterprise (2019)

  3. Ding, H.: Multi-scheduler in Kubernetes. https://stupefied-goodall-e282f7.netlify.app/contributors/design-proposals/scheduling/multiple-schedulers/. Accessed 20 June 2021

  4. Flux framework: a flexible framework for resource management customized for your HPC site. http://ux-framework.org. Accessed 20 June 2021. Flux Framework Community

  5. Fluxion: an advanced graph-based scheduler for HPC. https://github.com/ux-framework/ux-sched. Accessed 20 June 2021. Flux Framework Community

  6. The Apache Software Foundation. Apache Mesos. http://mesos.apache.org/. Accessed 20 June 2021

  7. Gartner, Inc.: Gartner forecasts worldwide public cloud end-user spending to grow 23% in 2021. https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-grow-23-percent-in-2021. Accessed 20 June 2021

  8. Hyperion Research. How cloud computing is changing HPC spending. https://hyperionresearch.com/wp-content/uploads/2021/01/Hyperion-Research-Special-Analysis-Clouds-and-HPC-December-2020.pdf. Accessed 20 June 2021

  9. IBM LSF-Kubernetes. https://github.com/IBMSpectrumComputing/lsf-kubernetes. Accessed 20 June 2021. IBM

  10. IBM Spectrum LSF. https://www.ibm.com/. Accessed 20 June 2021. IBM

  11. Jacobs, S.A., et al.: Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models. Int. J. High Perform. Comput. Appl. 35, 469–482 (2021)

    Article  Google Scholar 

  12. Kube Batch. https://awesomeopensource.com/project/kubernetes-sigs/kube-batch. Accessed 20 June 2021

  13. Volcano Community Maintainer. Volcano: collision between containers and batch computing. https://www.cncf.io/blog/2021/02/26/volcano-collision-between-containers-and-batch-computing/. Accessed 20 June 2021

  14. Minnich, A.J., et al.: AMPL: a data-driven modeling pipeline for drug discovery. J. Chem. Inf. Model. 60(4), 1955–1968 (2020)

    Article  Google Scholar 

  15. Node Feature Discovery. https://kubernetes-sigs.github.io/node-feature-discovery/master/get-started/index.html. Accessed 12 Sept 2021. The Kubernetes SIGs

  16. Novella, J.A., et al.: Container-based bioinformatics with Pachyderm. Bioinformatics 35(5), 839–846 (2019)

    Article  MathSciNet  Google Scholar 

  17. Peterson, J.L., et al.: Merlin: enabling machine learning-ready HPC ensembles. In: CoRR abs/1912.02892 (2019)

    Google Scholar 

  18. Pod lifecycle. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/. Accessed 20 June 2021. The Kubernetes Authors

  19. Red Hat Certified optional operator for secondary schedulers. https://github.com/openshift/secondary-scheduler-operator. 24 Sept 2021. Red Hat

  20. Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015)

    Article  Google Scholar 

  21. RFC 14: Canonical job specification. https://ux-framework.readthedocs.io/projects/ux-rfc/en/latest/spec_14.html. Accessed 20 June 2021. Flux Framework Community

  22. Scheduling Framework. https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/624-scheduling-framework/README.md. Accessed 20 June 2021. The Kubernetes Authors

  23. Schwarzkopf, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)

    Google Scholar 

  24. Sehgal, S., et al.: Topology awareness in Kubernetes part 2: don’t we already have a topology manager? https://www.openshift.com/blog/topology-awareness-in-kubernetes-part-2-dont-we-already-have-a-topology-manager. Accessed 20 June 2021. Topology-aware Scheduling Working Group

  25. Thompson, N.C., Spanuth, S.: The decline of computers as a general purpose technology. Commun. ACM 64(3), 64–72 (2021)

    Article  Google Scholar 

  26. User Admission Controller. https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/. Accessed 20 June 2021. The Kubernetes Authors

  27. Vetter, J.S., et al.: Extreme heterogeneity 2018 - productive computational science in the era of extreme heterogeneity: report for DOE ASCR workshop on extreme heterogeneity (2018). https://www.osti.gov/biblio/1473756. https://doi.org/10.2172/1473756

  28. Volcano Kubernetes Native Batch System. https://volcano.sh. Accessed 20 June 2021

  29. Wang, K., et al.: Towards scalable distributed workload manager with monitoring-based weakly consistent resource stealing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. HPDC, Portland, Oregon, USA, pp. 219–222 (2015)

    Google Scholar 

  30. PBS Works. Kubernetes connector for PBS professional. https://github.com/PBSPro/kubernetes-pbspro-connector. Accessed 20 June 2021

  31. Yang, W., et al.: YuniKorn: a universal resources scheduler. https://blog.cloudera.com/yunikorn-a-universal-resources-scheduler. Accessed 20 June 2021. Cloudera

  32. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3

    Chapter  Google Scholar 

  33. Zhou, N., et al.: Container orchestration on HPC systems through Kubernetes. J. Cloud Comput. 10(1), 16 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-823344).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong H. Ahn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Misale, C. et al. (2022). Towards Standard Kubernetes Scheduling Interfaces for Converged Computing. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96498-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96497-9

  • Online ISBN: 978-3-030-96498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics