Abstract
Supercomputer users when submitting jobs often overestimate walltime. These inaccuracies lead to the jobs completion before schedule and hence the decreased efficiency of job scheduling. Machine learning, using various characteristics of user jobs, can provide job walltime forecasts before the job starts. The use of forecasts by the supercomputer job management system makes it possible to increase the efficiency of scheduling and executing jobs. In this paper, we study the efficiency of using the forecasted execution time of jobs in a geographically distributed network of supercomputer centers with de-centralized management. The execution time of a job on the computing resources of different supercomputer centers may vary. The threshold value of forecast accuracy is evaluated when scheduling jobs in a supercomputer network becomes efficient. Estimations of scheduling efficiency are made, taking into account the forecasts of job walltime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shabanov, B., Ovsiannikov, A., Baranov, A., Leshchev, S., Dolgov, B., Derbyshev, D.: The distributed network of the supercomputer centers for collaborative research. Program. Sist. Teor. Prilozh. 8:4(35), 245–262 (2017). https://doi.org/10.25209/2079-3316-2017-8-4-245-262
Reuther, et al.: Scalable system scheduling for HPC and big data. J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009
Baranov, A.V., Tikhomirov, A.I.: Methods and tools for organizing the global job queue in the geographically distributed computing system. Vestn. Yuzh. Ural. Univ. Ser. Vychisl. Mat. Programm. 6(4), 28–42 (2017). https://doi.org/10.14529/cmse170403
Baranov, A., Telegin, P., Tikhomirov, A.: Comparison of auction methods for job scheduling with absolute priorities. In: Malyshkin, V. (ed.) PaCT 2017. LNCS, vol. 10421, pp. 387–395. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62932-2_37
Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015). Article 64, pp. 1–10 (2015). https://doi.org/10.1145/2807591.2807646
LUNARC Documentation pages. https://lunarc-documentation.readthedocs.io/en/latest/batch_system/. Accessed 10 Feb 2021
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007). https://doi.org/10.1109/tpds.2007.70606
Savin, G.I., Shabanov, B.M., Telegin, P.N., Baranov, A.V.: Joint supercomputer center of the Russian academy of sciences: present and future. Lobachevskii J. Math. 40(11), 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271
Guo, J., Nomura, A., Barton, R., Zhang, H., Matsuoka, S.: Machine learning predictions for underestimation of job runtime on HPC system. In: Yokota, R., Wu, W. (eds.) SCFA 2018. LNCS, vol. 10776, pp. 179–198. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69953-0_11
Klusáček, D., Chlumský, V.: Evaluating the impact of soft walltimes on job scheduling performance. In: Klusáček, D., Cirne, W., Desai, N. (eds.) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science, vol. 11332, pp. 15–38. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10632-4_2
Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS Pro: grid computing and scheduling attributes. Grid Resour. Manag. 64, 183–190 (2004). https://doi.org/10.1007/978-1-4615-0509-9_13
Rubio, J.C., Villapando, A., Matira, C., Aborot, J.: Correcting job walltime in a resource-constrained environment. In: Panda, D.K. (ed.) SCFA 2020. LNCS, vol. 12082, pp. 118–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-48842-0_8
Klusáček, D., Tóth, V., Podolníková, G.: Complex job scheduling simulations with Alea 4. In: Ninth EAI International Conference on Simulation Tools and Techniques (SimuTools 2016), pp. 124–129. ACM (2016)
Klusáček, D., Soysal, M.: Walltime prediction and its impact on job scheduling performance and predictability. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2020. LNCS, vol. 12326, pp. 127–144. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63171-0_7
Smeliansky R., Mei, H.: MC2E – meta-cloud computing environment. In: 2020 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), pp. 1–2 (2020). https://doi.org/10.1109/MoNeTeC49726.2020.9258124
Chupakhin, A., Bahmurov, A., Antonenko, V., Ishelev, G.: Application of recommender systems approaches to the MPI program execution time prediction. In: 2020 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), pp. 1–7 (2020). https://doi.org/10.1109/MoNeTeC49726.2020.9258345
Baranov, A., Nikolaev, D.: Machine learning to predict the supercomputer jobs execution time. Softw. Syst. (2), 218–228 (2020). https://doi.org/10.15827/0236-235X.130.218-228
Savin, G.I., Shabanov, B.M., Nikolaev, D.S., et al.: Jobs runtime forecast for JSCC RAS supercomputers using machine learning methods. Lobachevskii J. Math. 41, 2593–2602 (2020). https://doi.org/10.1134/S1995080220120343
Devyatkov, V.: Methodology and Technology of Simulation Studies of Complex Systems: Current State and Prospects of Development. INFRA-M Publishing House, Moscow (2013). ISBN 978-5-9558-0338-8
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 178–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_10
Vohra, D.: Using elasticsearch. In: Pro Couchbase Development. Apress, Berkeley, CA (2015).https://doi.org/10.1007/978-1-4842-1434-3_7
Christudas, B.: Install, configure, and run RabbitMQ cluster. In: Practical Microservices Architectural Patterns. Apress, Berkeley, CA (2019). https://doi.org/10.1007/978-1-4842-4501-9_21
Acknowledgments
The study was carried out within state assignment project 0580-2021-0016 and was partially supported by RFBR project No. 18-29-03236. Supercomputer MVS-10P in JSCC RAS was used in research.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Shabanov, B., Baranov, A., Telegin, P., Tikhomirov, A. (2021). Influence of Execution Time Forecast Accuracy on the Efficiency of Scheduling Jobs in a Distributed Network of Supercomputers. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2021. Lecture Notes in Computer Science(), vol 12942. Springer, Cham. https://doi.org/10.1007/978-3-030-86359-3_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-86359-3_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86358-6
Online ISBN: 978-3-030-86359-3
eBook Packages: Computer ScienceComputer Science (R0)