Knowledge-based job scheduling for HPC

Diel, Guilherme; Nascimento Kraus, Ana Eloina; Piêgas Koslovski, Guilherme

doi:10.1007/s10586-024-04734-7

Knowledge-based job scheduling for HPC

Published: 21 January 2025

Volume 28, article number 153, (2025)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Guilherme Diel¹,
Ana Eloina Nascimento Kraus¹ &
Guilherme Piêgas Koslovski^1,2

148 Accesses
Explore all metrics

Abstract

Scheduling jobs for executing atop large-scale High-Performance Computing (HPC) infrastructures is a fundamental administrative challenge essential to maximize the productivity indicators of such complex Data Centers (DCs). Several policies were proposed by specialized literature, considering specific scenarios and applications, or proposing DC-tailored heuristics. However, the definition of the appropriate policy depends directly on the applications and infrastructure load, that is, indicators that vary over time. Given this context, this work advances the field on applying Machine Learning (ML) techniques to improve the existing scheduling policies. Specifically, we propose Knowledge-based Job Scheduler (KJS), which uses regression techniques to characterize the performance indicators of existing schedulers, composing a knowledge database, and instead of proposing a one-size-fits-all scheduling function, KJS consolidates the information and uses data classification for defining the jobs’ execution order. The simulation campaign demonstrates that KJS can adapt to different workloads when compared to the individual use of existing policies. Essentially, KJS combines polynomial data regression and classification for improving the performance indicators of HPC DC, in both the users’ and the administrators’ perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing job scheduling by using broad learning to predict execution times on HPC clusters

Article 23 February 2023

Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions

Article 23 May 2022

Machine Learning-Based Online Scheduling in Distributed Computing

Notes

PyBatSim available at: https://gitlab.inria.fr/batsim/pybatsim.

References

Dongarra, J., Luszczek, P.: TOP500. Boston, MA: Springer US, pp. 2055–2057. [Online]. Available: https://doi.org/10.1007/978-0-387-09766-4_157 (2011)
Reed, D., Gannon, D., Dongarra, J.: “Reinventing high performance computing: Challenges and opportunities,” (2022)
Abbasloo, S., Yen, C.-Y., Chao, H. J.: “Classic meets modern: a pragmatic learning-based congestion control for the internet,” in Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, ser. SIGCOMM ’20. New York, NY, USA: Association for Computing Machinery, p. 632-647. [Online]. Available: https://doi.org/10.1145/3387514.3405892 (2020)
Yen, C.-Y., Abbasloo, S., Chao, H. J.: “Computers can learn from the heuristic designs and master internet congestion control,” in Proceedings of the ACM SIGCOMM 2023 Conference, ser. ACM SIGCOMM ’23. New York, NY, USA: Association for Computing Machinery, p. 255-274. [Online]. Available: https://doi.org/10.1145/3603269.3604838 (2023)
Carastan-Santos, D., de Camargo, R. Y.: “Obtaining dynamic scheduling policies with simulation and machine learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’17. New York, NY, USA: Association for Computing Machinery, (2017). [Online]. Available: https://doi.org/10.1145/3126908.3126955
Koslovski, G. P., Pereira, K., Albuquerque, P. R.: “Dag-based workflows scheduling using actor-critic deep reinforcement learning,” Future Generation Computer Systems, vol. 150, pp. 354–363, (2024). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X23003485
Li, J., Zhang, X., Wei, J., Ji, Z., Wei, Z.: “Garlsched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems,” Future Generation Computer Systems, vol. 135, pp. 259–269, (2022). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X22001613
Yen, C.-Y., Abbasloo, S., Chao, H. J.: “Computers can learn from the heuristic designs and master internet congestion control,” in Proceedings of the ACM SIGCOMM 2023 Conference, (2023), pp. 255–274
Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: “One can only gain by replacing easy backfilling: A simple scheduling policies case study,” in,: 19th IEEE/ACM International Symposium on Cluster. Cloud and Grid Computing (CCGRID) 2019, 1–10 (2019)
Blenk, A., Kalmbach, P., Kellerer, W., Schmid, S.: “O’zapft is: Tap your network algorithm’s big data!” in Proceedings of the Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, (2017), pp. 19–24
Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., Wong, P.: “Theory and practice in parallel job scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 8, 27 (1997)
Feitelson, D. G., Rudolph, L.: “Metrics and benchmarking for parallel job scheduling,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer, pp. 1–24 (1998)
Casagrande, L., Koslovski, G.,P. Miers, C.C., M.A., Gonzalez, N.: “Don’t hurry be green: scheduling servers shutdown in grid computing with deep reinforcement learning,” in International Journal of Grid and Utility Computing. Inderscience Publishers, (2022)
Liu, C.-L., Chang, C.-C., Tseng, C.-J.: Actor-critic deep reinforcement learning for solving job shop scheduling problems. IEEE Access 8, 71 752-71 762 (2020)
Article MATH Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. Journal of Parallel and Distributed Computing 74(10), 2967–2982 (2014)
Article MATH Google Scholar
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: “Batsim: a realistic language-independent resources and jobs management systems simulator,” in Job Scheduling Strategies for Parallel Processing: 19th and 20th International Workshops, JSSPP: Hyderabad, India, May 26, 2015 and JSSPP 2016, Chicago, IL, USA, May 27, 2016, Revised Selected Papers 19. Springer 2017, 178–197 (2015)
Ontivero-Ortega, M., Lage-Castellanos, A., Valente, G., Goebel, R., Valdes-Sosa, M.: “Fast gaussian naïve bayes for searchlight classification analysis,” NeuroImage, vol. 163, pp. 471–479, (2017). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1053811917307371

Download references

Acknowledgements

This work was funded by the National Council for Scientific and Technological Development (CNPq), the Santa Catarina State Research and Innovation Support Foundation (FAPESC), Santa Catarina State University (UDESC), and developed at Laboratory of Parallel and Distributed Processing (LabP2D).

Author information

Authors and Affiliations

Laboratory of Parallel and Distributed Processing, Department of Computer Science, Santa Catarina State University, Joinville, SC, Brazil
Guilherme Diel, Ana Eloina Nascimento Kraus & Guilherme Piêgas Koslovski
Graduate Program in Applied Computing, Santa Catarina State University, Joinville, SC, Brazil
Guilherme Piêgas Koslovski

Authors

Guilherme Diel
View author publications
You can also search for this author inPubMed Google Scholar
Ana Eloina Nascimento Kraus
View author publications
You can also search for this author inPubMed Google Scholar
Guilherme Piêgas Koslovski
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Guilherme Piêgas Koslovski.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Diel, G., Nascimento Kraus, A.E. & Piêgas Koslovski, G. Knowledge-based job scheduling for HPC. Cluster Comput 28, 153 (2025). https://doi.org/10.1007/s10586-024-04734-7

Download citation

Received: 04 March 2024
Revised: 03 June 2024
Accepted: 18 August 2024
Published: 21 January 2025
DOI: https://doi.org/10.1007/s10586-024-04734-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge-based job scheduling for HPC

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimizing job scheduling by using broad learning to predict execution times on HPC clusters

Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions

Machine Learning-Based Online Scheduling in Distributed Computing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now