Skip to main content

Influence of Execution Time Forecast Accuracy on the Efficiency of Scheduling Jobs in a Distributed Network of Supercomputers

  • Conference paper
  • First Online:
Parallel Computing Technologies (PaCT 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12942))

Included in the following conference series:

  • 859 Accesses

Abstract

Supercomputer users when submitting jobs often overestimate walltime. These inaccuracies lead to the jobs completion before schedule and hence the decreased efficiency of job scheduling. Machine learning, using various characteristics of user jobs, can provide job walltime forecasts before the job starts. The use of forecasts by the supercomputer job management system makes it possible to increase the efficiency of scheduling and executing jobs. In this paper, we study the efficiency of using the forecasted execution time of jobs in a geographically distributed network of supercomputer centers with de-centralized management. The execution time of a job on the computing resources of different supercomputer centers may vary. The threshold value of forecast accuracy is evaluated when scheduling jobs in a supercomputer network becomes efficient. Estimations of scheduling efficiency are made, taking into account the forecasts of job walltime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shabanov, B., Ovsiannikov, A., Baranov, A., Leshchev, S., Dolgov, B., Derbyshev, D.: The distributed network of the supercomputer centers for collaborative research. Program. Sist. Teor. Prilozh. 8:4(35), 245–262 (2017). https://doi.org/10.25209/2079-3316-2017-8-4-245-262

  2. Reuther, et al.: Scalable system scheduling for HPC and big data. J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009

    Article  Google Scholar 

  3. Baranov, A.V., Tikhomirov, A.I.: Methods and tools for organizing the global job queue in the geographically distributed computing system. Vestn. Yuzh. Ural. Univ. Ser. Vychisl. Mat. Programm. 6(4), 28–42 (2017). https://doi.org/10.14529/cmse170403

  4. Baranov, A., Telegin, P., Tikhomirov, A.: Comparison of auction methods for job scheduling with absolute priorities. In: Malyshkin, V. (ed.) PaCT 2017. LNCS, vol. 10421, pp. 387–395. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62932-2_37

    Chapter  Google Scholar 

  5. Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015). Article 64, pp. 1–10 (2015). https://doi.org/10.1145/2807591.2807646

  6. LUNARC Documentation pages. https://lunarc-documentation.readthedocs.io/en/latest/batch_system/. Accessed 10 Feb 2021

  7. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007). https://doi.org/10.1109/tpds.2007.70606

    Article  Google Scholar 

  8. Savin, G.I., Shabanov, B.M., Telegin, P.N., Baranov, A.V.: Joint supercomputer center of the Russian academy of sciences: present and future. Lobachevskii J. Math. 40(11), 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271

    Article  MATH  Google Scholar 

  9. Guo, J., Nomura, A., Barton, R., Zhang, H., Matsuoka, S.: Machine learning predictions for underestimation of job runtime on HPC system. In: Yokota, R., Wu, W. (eds.) SCFA 2018. LNCS, vol. 10776, pp. 179–198. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69953-0_11

    Chapter  Google Scholar 

  10. Klusáček, D., Chlumský, V.: Evaluating the impact of soft walltimes on job scheduling performance. In: Klusáček, D., Cirne, W., Desai, N. (eds.) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science, vol. 11332, pp. 15–38. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10632-4_2

  11. Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS Pro: grid computing and scheduling attributes. Grid Resour. Manag. 64, 183–190 (2004). https://doi.org/10.1007/978-1-4615-0509-9_13

    Article  Google Scholar 

  12. Rubio, J.C., Villapando, A., Matira, C., Aborot, J.: Correcting job walltime in a resource-constrained environment. In: Panda, D.K. (ed.) SCFA 2020. LNCS, vol. 12082, pp. 118–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-48842-0_8

    Chapter  Google Scholar 

  13. Klusáček, D., Tóth, V., Podolníková, G.: Complex job scheduling simulations with Alea 4. In: Ninth EAI International Conference on Simulation Tools and Techniques (SimuTools 2016), pp. 124–129. ACM (2016)

    Google Scholar 

  14. Klusáček, D., Soysal, M.: Walltime prediction and its impact on job scheduling performance and predictability. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2020. LNCS, vol. 12326, pp. 127–144. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63171-0_7

    Chapter  Google Scholar 

  15. Smeliansky R., Mei, H.: MC2E – meta-cloud computing environment. In: 2020 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), pp. 1–2 (2020). https://doi.org/10.1109/MoNeTeC49726.2020.9258124

  16. Chupakhin, A., Bahmurov, A., Antonenko, V., Ishelev, G.: Application of recommender systems approaches to the MPI program execution time prediction. In: 2020 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), pp. 1–7 (2020). https://doi.org/10.1109/MoNeTeC49726.2020.9258345

  17. Baranov, A., Nikolaev, D.: Machine learning to predict the supercomputer jobs execution time. Softw. Syst. (2), 218–228 (2020). https://doi.org/10.15827/0236-235X.130.218-228

  18. Savin, G.I., Shabanov, B.M., Nikolaev, D.S., et al.: Jobs runtime forecast for JSCC RAS supercomputers using machine learning methods. Lobachevskii J. Math. 41, 2593–2602 (2020). https://doi.org/10.1134/S1995080220120343

    Article  MathSciNet  MATH  Google Scholar 

  19. Devyatkov, V.: Methodology and Technology of Simulation Studies of Complex Systems: Current State and Prospects of Development. INFRA-M Publishing House, Moscow (2013). ISBN 978-5-9558-0338-8

    Google Scholar 

  20. Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 178–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_10

    Chapter  Google Scholar 

  21. Vohra, D.: Using elasticsearch. In: Pro Couchbase Development. Apress, Berkeley, CA (2015).https://doi.org/10.1007/978-1-4842-1434-3_7

  22. Christudas, B.: Install, configure, and run RabbitMQ cluster. In: Practical Microservices Architectural Patterns. Apress, Berkeley, CA (2019). https://doi.org/10.1007/978-1-4842-4501-9_21

Download references

Acknowledgments

The study was carried out within state assignment project 0580-2021-0016 and was partially supported by RFBR project No. 18-29-03236. Supercomputer MVS-10P in JSCC RAS was used in research.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shabanov, B., Baranov, A., Telegin, P., Tikhomirov, A. (2021). Influence of Execution Time Forecast Accuracy on the Efficiency of Scheduling Jobs in a Distributed Network of Supercomputers. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2021. Lecture Notes in Computer Science(), vol 12942. Springer, Cham. https://doi.org/10.1007/978-3-030-86359-3_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86359-3_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86358-6

  • Online ISBN: 978-3-030-86359-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics