Skip to main content
Log in

Predicting job finish time based on parameter features and running logs in supercomputing system

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Accurate job finish time estimation is one of the key parts of scheduling strategy design in supercomputing systems. Existing research works concentrate on designing a better or more complex machine learning model to achieve accurate job runtime prediction based on the non-job-specific parameters. These parameters include the number of processors consumed, the user-estimated runtime, job submit time, job ID, and so on. However, we can extract more useful information from the system logs to assist the runtime prediction. The system logs in supercomputing always contain the intermediate output results and input parameters, which motivate us to analyze the running status of the job and predict the job finish time. Since VASP is one of the most popular supercomputing applications in the world, in this paper, we conduct the first investigation into running features and deeply analyze the job-specific parameters. Based on the running and job-specific features, we propose RunningNet, a dynamic finish time prediction model during job running, which contains the running features represented by a time series and the parameters features. Experiments on the VASP job set in the supercomputing system at USTC show that RunningNet achieves state-of-the-art results. The Mean Average Percentage Error metric reaches about 10.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

All the data used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

Notes

  1. https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

References

  1. Geist A, Reed DA (2017) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31(1):104–113

    Article  Google Scholar 

  2. Patel T, Liu Z, Kettimuthu R, Rich P, Allcock W, Tiwari D (2020) Job characteristics on large-scale systems: long-term analysis, quantification and implications. SC20: International Conference for High Performance Computing. Networking, Storage and Analysis (SC), pp 1–17

  3. Chiang SH, Arpaci-Dusseau A, Vernon MK (2002) The impact of more accurate requested runtimes on production job scheduling performance. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 103–127

  4. Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803

    Article  Google Scholar 

  5. Gaussier E, GlesserD, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10

  6. The Vienna Ab initio Simulation Package. Available online: https://www.vasp.at/ Accessed on 1 Feb 2022

  7. Supercomputing Center of USTC. Available online: http://scc.ustc.edu.cn/ Accessed on 1 Feb 2022

  8. Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning, in Proceedings of the 15th ACM workshop on hot topics in networks, pp 50–56

  9. Vasupongayya S, Chiang SH (2007) Performance problems of using system-predicted runtimes for parallel job scheduling. In: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems, pp 369–374

  10. Pumma S, Feng W, Phunchongharn P, Chapeland S, Achalakul T (2017) A runtime estimation framework for alice. Future Gener Comput Syst 72:65–77

    Article  Google Scholar 

  11. Minh TN, Wolters L (2010) Using historical data to predict application runtimes on backfilling parallel systems. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 246–252

  12. Liang F, Liu Y, Liu H, Ma S, Schnor B (2015) A parallel job execution time estimation approach based on user submission patterns within computational grids. Int J Parallel Program 43(3):440–454

    Article  Google Scholar 

  13. Tsafrir D, Etsion Y, Feitelson DG (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982

    Article  Google Scholar 

  14. Wyatt M, Herbein S, Ahn D, Moody A, Gamblin T, Taufer M (2017) Unstructured Data Analytics for Next-generation HPC Schedulers: Capturing Jobs’ Needs from Unstructured Job Scripts (No. LLNL-CONF-728884). Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)

  15. Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in HPC system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316

  16. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794

  17. Wu G, Shen Y (2019) Runtime prediction of jobs for backfilling optimization. J Chinese Comput Syst 40:6–12

    Google Scholar 

  18. Carvalho M, Brasileiro F (2013) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48

  19. Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  20. Tang W, Desai N, Buettner D, Lan Z (2013) Job scheduling with adjusted runtime estimates on production supercomputers. J Parallel Distrib Comput 73(7):926–938

    Article  Google Scholar 

  21. Smith W, Foster I, Taylor V (2004) Predicting application run times with historical information. J Parallel Distrib Comput 64(9):1007–1016

    Article  MATH  Google Scholar 

  22. Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116

    Article  Google Scholar 

  23. Cunha RL, Rodrigues ER, Tizzei LP, Netto MA (2017) Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Gener Comput Syst 67:35–46

    Article  Google Scholar 

  24. Phinjaroenphan P, Bevinakoppa S, Zeephongsekul PA (2005) A method for estimating the execution time of a parallel task on a grid node. In: European Grid Conference, pp 226–236

  25. Chen X, Lu CD, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W), pp 1–8

  26. Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651

    Article  Google Scholar 

  27. Li J, Ma X, Singh K, Schulz M, de Supinski BR, McKee SA (2009) Machine learning based online performance prediction for runtime parallelization and task scheduling. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 89–100

  28. Matsunaga A, Fortes JA (2010) On the use of machine learning to predict the time and resources consumed by applications. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 495–504

  29. Gupta C, Mehta A, Dayal U (2008) PQR: predicting query execution times for autonomous workload management. In: 2008 International Conference on Autonomic Computing, Cloud and Grid Computing, pp 13–22

  30. Miu T, Missier P (2012) Predicting the execution time of workflow activities based on their input features. In: 2012 SC Companion: high performance computing, networking storage and analysis, pp 64–72

  31. Sun J, Sun G, Zhan S, Zhang J, Chen J (2020) Automated performance modeling of HPC applications using machine learning. IEEE Trans Comput 69(5):749–763

    Article  MATH  Google Scholar 

  32. Elmroth E, Tordsson J (2009) Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions. Future Gener Comput Syst 24(6):585–593

    Article  Google Scholar 

  33. Feng J, Liu G, Zhang Z, Li T, Li Y, Sun F (2018) Quota-constrained job submission behavior at commercial supercomputer. In: Conference on Advanced Computer Architecture, pp 219–231

  34. Schlagkamp S, Da Silva RF, Renker J, Rinkenauer G (2016) Analyzing users in parallel computing: a user-oriented study. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 295–402

  35. Schlagkamp S, Da Silva RF, Deelman E, Schwiegelshohn U (2016) Understanding user behavior: from HPC to HTC. Procedia Comput Sci 80:2241–2245

    Article  Google Scholar 

  36. Schlagkamp S, Hofmann M, Eufinger L, Da Silva RF (2016) Increasing waiting time satisfaction in parallel job scheduling via a flexible MILP approach. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 164–171

  37. Schlagkamp S (2015) Influence of dynamic think times on parallel job scheduler performances in generative simulations. In: job scheduling strategies for parallel processing, pp 123–137

Download references

Acknowledgements

We are thankful for the support from Supercomputing Center (SCC) of USTC. We are thankful to administrators in SCC for data and professional knowledge support. We gratefully acknowledge the computing resources provided by Network and Information Center to run our experiments.

Funding

This research was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA19020102.

Author information

Authors and Affiliations

Authors

Contributions

Q.W. involved in conceptualization; Q.W. took part in methodology; Q.W. and H.Z. involved in software; Y.S. and X.L. took part in data curation; Q.W., H.Z. and J.L. took part in writing.

Corresponding author

Correspondence to Jing Li.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Code availability

All the code used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Zhang, H., Li, J. et al. Predicting job finish time based on parameter features and running logs in supercomputing system. J Supercomput 78, 18551–18577 (2022). https://doi.org/10.1007/s11227-022-04582-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04582-5

Keywords

Navigation