Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray

Liu, Tingkai; Ellis, Marquita; Costa, Carlos; Misale, Claudia; Kokkila-Schumacher, Sara; Jung, Jinwook; Nam, Gi-Joon; Kindratenko, Volodymyr

doi:10.1007/978-3-031-40843-4_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1039 Accesses

Abstract

We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM^© and bursting to Cloud managed by Kubernetes^®. Compared to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, the framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework is open-sourced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gentzsch, W.: Sun grid engine: towards creating a compute power grid. In: Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 35–36. IEEE (2001)
Google Scholar
Hu, H., Li, P., Huang, J.Z.: Enabling high-dimensional Bayesian optimization for efficient failure detection of analog and mixed-signal circuits. In: Proceedings of the DAC, pp. 1–6, June 2019
Google Scholar
kubernetes: Production-grade container orchestration. https://kubernetes.io
Liu, F., Keahey, K., Riteau, P., Weissman, J.: Dynamically negotiating capacity between on-demand and batch clusters. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 493–503 (2018)
Google Scholar
Moritz, P., et al.: Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577 (2018)
Google Scholar
Nyu high performance computing - hpc bursting to cloud. https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/cloud-computing/hpc-bursting-to-cloudD
Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20 (2007). https://doi.org/10.1109/MCSE.2007.58
Article Google Scholar
Piras, Marco Enrico, Pireddu, Luca, Moro, Marco, Zanetti, Gianluigi: Container orchestration on HPC clusters. In: Weiland, Michèle, Juckeland, Guido, Alam, Sadaf, Jagode, Heike (eds.) ISC High Performance 2019. LNCS, vol. 11887, pp. 25–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34356-9_3
Chapter Google Scholar
Red hat openshift. https://docs.openshift.com/
riscv-mini. https://github.com/ucb-bar/riscv-mini
Staples, G.: Torque resource manager. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 8-es. SC ’06, Association for Computing Machinery, New York, NY, USA (2006)
Google Scholar
Weekly, S., Mertes, Z., Gough, E., Smith, P.: Azure-based hybrid cloud extension to campus clusters. In: Practice and Experience in Advanced Research Computing. PEARC ’22, ACM, New York, NY, USA (2022)
Google Scholar
Yoo, Andy B.., Jette, Morris A.., Grondona, Mark: SLURM: simple Linux utility for resource management. In: Feitelson, Dror, Rudolph, Larry, Schwiegelshohn, Uwe (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Chapter Google Scholar
Zhou, N., Georgiou, Y., Zhong, L., Zhou, H., Pospieszny, M.: Container orchestration on HPC systems. In: 2020 IEEE 13th International Conference on Cloud Computing (CLOUD), pp. 34–36 (2020)
Google Scholar

Download references

Acknowledgement

This work is supported by the IBM-Illinois Discovery Accelerator Institute. This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois Urbana-Champaign.

Author information

Authors and Affiliations

University of Illinois Urbana-Champaign, 1205 W. Clark St., Urbana, IL, 61801, USA
Tingkai Liu & Volodymyr Kindratenko
IBM Thomas J. Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, NY, 10598, USA
Marquita Ellis, Carlos Costa, Claudia Misale, Sara Kokkila-Schumacher, Jinwook Jung & Gi-Joon Nam

Authors

Tingkai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Marquita Ellis
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Costa
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Misale
View author publications
You can also search for this author in PubMed Google Scholar
Sara Kokkila-Schumacher
View author publications
You can also search for this author in PubMed Google Scholar
Jinwook Jung
View author publications
You can also search for this author in PubMed Google Scholar
Gi-Joon Nam
View author publications
You can also search for this author in PubMed Google Scholar
Volodymyr Kindratenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Volodymyr Kindratenko .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, T. et al. (2023). Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_16
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray