Skip to main content

Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Abstract

We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM© and bursting to Cloud managed by Kubernetes®. Compared to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, the framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework is open-sourced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gentzsch, W.: Sun grid engine: towards creating a compute power grid. In: Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 35–36. IEEE (2001)

    Google Scholar 

  2. Hu, H., Li, P., Huang, J.Z.: Enabling high-dimensional Bayesian optimization for efficient failure detection of analog and mixed-signal circuits. In: Proceedings of the DAC, pp. 1–6, June 2019

    Google Scholar 

  3. kubernetes: Production-grade container orchestration. https://kubernetes.io

  4. Liu, F., Keahey, K., Riteau, P., Weissman, J.: Dynamically negotiating capacity between on-demand and batch clusters. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 493–503 (2018)

    Google Scholar 

  5. Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577 (2018)

    Google Scholar 

  6. Nyu high performance computing - hpc bursting to cloud. https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/cloud-computing/hpc-bursting-to-cloudD

  7. Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20 (2007). https://doi.org/10.1109/MCSE.2007.58

    Article  Google Scholar 

  8. Piras, Marco Enrico, Pireddu, Luca, Moro, Marco, Zanetti, Gianluigi: Container orchestration on HPC clusters. In: Weiland, Michèle, Juckeland, Guido, Alam, Sadaf, Jagode, Heike (eds.) ISC High Performance 2019. LNCS, vol. 11887, pp. 25–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34356-9_3

    Chapter  Google Scholar 

  9. Red hat openshift. https://docs.openshift.com/

  10. riscv-mini. https://github.com/ucb-bar/riscv-mini

  11. Staples, G.: Torque resource manager. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 8-es. SC ’06, Association for Computing Machinery, New York, NY, USA (2006)

    Google Scholar 

  12. Weekly, S., Mertes, Z., Gough, E., Smith, P.: Azure-based hybrid cloud extension to campus clusters. In: Practice and Experience in Advanced Research Computing. PEARC ’22, ACM, New York, NY, USA (2022)

    Google Scholar 

  13. Yoo, Andy B.., Jette, Morris A.., Grondona, Mark: SLURM: simple Linux utility for resource management. In: Feitelson, Dror, Rudolph, Larry, Schwiegelshohn, Uwe (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3

    Chapter  Google Scholar 

  14. Zhou, N., Georgiou, Y., Zhong, L., Zhou, H., Pospieszny, M.: Container orchestration on HPC systems. In: 2020 IEEE 13th International Conference on Cloud Computing (CLOUD), pp. 34–36 (2020)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the IBM-Illinois Discovery Accelerator Institute. This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois Urbana-Champaign.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Volodymyr Kindratenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, T. et al. (2023). Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics