Abstract
Accurate and fast simulation of HPC job scheduling is an important tool for exploring the effect of different scheduling strategies on production systems and for providing insight into future HPC design. Current realistic simulations are computationally intensive and cannot provide a rapid feedback loop to facilitate the development of novel scheduling strategies. This work presents a lightweight simulation of the workload manager Slurm that is able to accurately reproduce the performance of the UK’s national supercomputer ARCHER2 using historical workload accounting data. The simulation achieves a speed up of \({\sim }400\) over a period of full system utilisation, allowing for months of activity to be simulated in hours, while maintaining wait times accurate to 7%. The simulator design supports incorporating external factors into scheduling to enable comparison of power-aware strategies. By using the simulation to evaluate the effect of multiple possible scheduling changes to ARCHER2, focusing on improving power management, the potential to provide insight into configuration changes and extensions to the scheduling logic is demonstrated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Parallel workloads archive. https://www.cs.huji.ac.il/labs/parallel/workload/. Accessed 20 June 2023
Slurm workload manager. https://slurm.schedmd.com/documentation.html. Accessed 31 Mar 2023
Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput.: Pract. Experience 14(13–15), 1175–1220 (2002). https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.710
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014). http://hal.inria.fr/hal-01017319
Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 807–816. IEEE (2021)
Feitelson, D., Weil, A.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp. 542–546 (1998)
Jokanovic, A., D’Amico, M., Corbalan, J.: Evaluating Slurm simulator with real-machine Slurm and vice versa. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 72–82 (2018)
Kassab, A., Nicod, J.M., Philippe, L., Rehn-Sonigo, V.: Assessing the use of genetic algorithms to schedule independent tasks under power constraints. In: 2018 International Conference on High Performance Computing & Simulation (HPCS), pp. 252–259 (2018)
Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proceedings of the 5th IBERGRID (2011)
Martinasso, M., Gila, M., Bianco, M., Alam, S.R., McMurtrie, C., Schulthess, T.C.: RM-replay: a high-fidelity tuning, optimization and exploration tool for resource management. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2018, pp. 320–332 (2018)
Rodrigo, G.P., Elmroth, E., Östberg, P.-O., Ramakrishnan, L.: ScSF: a scheduling simulation framework. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2017. LNCS, vol. 10773, pp. 152–173. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77398-8_9
Simakov, N.A., et al.: A Slurm simulator: implementation and parametric analysis. In: Jarvis, S., Wright, S., Hammond, S. (eds.) PMBS 2017. LNCS, vol. 10724, pp. 197–217. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72971-8_10
Trofinoff, S., Benini, M.: Using and modifying the BSC Slurm workload simulator. Slurm User Group Meeting 2015 (2015). https://slurm.schedmd.com/SLUG15/BSC_Slurm_Workload_Simulator_Enhancements.pdf
Acknowledgements
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk). We would like to thank Dr Andrew Turner from EPCC for valuable discussions during development and validation with ARCHER2. Thanks to the Cambridge Service for Data Driven Discovery (CSD3) at the University of Cambridge for providing access to workload data. We acknowledge support of CSC who provided access to scheduler data from the LUMI system, owned by the EuroHPC Joint Undertaking.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wilkinson, A., Jones, J., Richardson, H., Dykes, T., Haus, UU. (2023). A Fast Simulator to Enable HPC Scheduling Strategy Comparisons. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)