Abstract
Accurate job finish time estimation is one of the key parts of scheduling strategy design in supercomputing systems. Existing research works concentrate on designing a better or more complex machine learning model to achieve accurate job runtime prediction based on the non-job-specific parameters. These parameters include the number of processors consumed, the user-estimated runtime, job submit time, job ID, and so on. However, we can extract more useful information from the system logs to assist the runtime prediction. The system logs in supercomputing always contain the intermediate output results and input parameters, which motivate us to analyze the running status of the job and predict the job finish time. Since VASP is one of the most popular supercomputing applications in the world, in this paper, we conduct the first investigation into running features and deeply analyze the job-specific parameters. Based on the running and job-specific features, we propose RunningNet, a dynamic finish time prediction model during job running, which contains the running features represented by a time series and the parameters features. Experiments on the VASP job set in the supercomputing system at USTC show that RunningNet achieves state-of-the-art results. The Mean Average Percentage Error metric reaches about 10.3%.













Similar content being viewed by others
Data availability
All the data used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.
References
Geist A, Reed DA (2017) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31(1):104–113
Patel T, Liu Z, Kettimuthu R, Rich P, Allcock W, Tiwari D (2020) Job characteristics on large-scale systems: long-term analysis, quantification and implications. SC20: International Conference for High Performance Computing. Networking, Storage and Analysis (SC), pp 1–17
Chiang SH, Arpaci-Dusseau A, Vernon MK (2002) The impact of more accurate requested runtimes on production job scheduling performance. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 103–127
Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803
Gaussier E, GlesserD, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10
The Vienna Ab initio Simulation Package. Available online: https://www.vasp.at/ Accessed on 1 Feb 2022
Supercomputing Center of USTC. Available online: http://scc.ustc.edu.cn/ Accessed on 1 Feb 2022
Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning, in Proceedings of the 15th ACM workshop on hot topics in networks, pp 50–56
Vasupongayya S, Chiang SH (2007) Performance problems of using system-predicted runtimes for parallel job scheduling. In: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems, pp 369–374
Pumma S, Feng W, Phunchongharn P, Chapeland S, Achalakul T (2017) A runtime estimation framework for alice. Future Gener Comput Syst 72:65–77
Minh TN, Wolters L (2010) Using historical data to predict application runtimes on backfilling parallel systems. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 246–252
Liang F, Liu Y, Liu H, Ma S, Schnor B (2015) A parallel job execution time estimation approach based on user submission patterns within computational grids. Int J Parallel Program 43(3):440–454
Tsafrir D, Etsion Y, Feitelson DG (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982
Wyatt M, Herbein S, Ahn D, Moody A, Gamblin T, Taufer M (2017) Unstructured Data Analytics for Next-generation HPC Schedulers: Capturing Jobs’ Needs from Unstructured Job Scripts (No. LLNL-CONF-728884). Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)
Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in HPC system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794
Wu G, Shen Y (2019) Runtime prediction of jobs for backfilling optimization. J Chinese Comput Syst 40:6–12
Carvalho M, Brasileiro F (2013) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48
Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge
Tang W, Desai N, Buettner D, Lan Z (2013) Job scheduling with adjusted runtime estimates on production supercomputers. J Parallel Distrib Comput 73(7):926–938
Smith W, Foster I, Taylor V (2004) Predicting application run times with historical information. J Parallel Distrib Comput 64(9):1007–1016
Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116
Cunha RL, Rodrigues ER, Tizzei LP, Netto MA (2017) Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Gener Comput Syst 67:35–46
Phinjaroenphan P, Bevinakoppa S, Zeephongsekul PA (2005) A method for estimating the execution time of a parallel task on a grid node. In: European Grid Conference, pp 226–236
Chen X, Lu CD, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W), pp 1–8
Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651
Li J, Ma X, Singh K, Schulz M, de Supinski BR, McKee SA (2009) Machine learning based online performance prediction for runtime parallelization and task scheduling. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 89–100
Matsunaga A, Fortes JA (2010) On the use of machine learning to predict the time and resources consumed by applications. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 495–504
Gupta C, Mehta A, Dayal U (2008) PQR: predicting query execution times for autonomous workload management. In: 2008 International Conference on Autonomic Computing, Cloud and Grid Computing, pp 13–22
Miu T, Missier P (2012) Predicting the execution time of workflow activities based on their input features. In: 2012 SC Companion: high performance computing, networking storage and analysis, pp 64–72
Sun J, Sun G, Zhan S, Zhang J, Chen J (2020) Automated performance modeling of HPC applications using machine learning. IEEE Trans Comput 69(5):749–763
Elmroth E, Tordsson J (2009) Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions. Future Gener Comput Syst 24(6):585–593
Feng J, Liu G, Zhang Z, Li T, Li Y, Sun F (2018) Quota-constrained job submission behavior at commercial supercomputer. In: Conference on Advanced Computer Architecture, pp 219–231
Schlagkamp S, Da Silva RF, Renker J, Rinkenauer G (2016) Analyzing users in parallel computing: a user-oriented study. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 295–402
Schlagkamp S, Da Silva RF, Deelman E, Schwiegelshohn U (2016) Understanding user behavior: from HPC to HTC. Procedia Comput Sci 80:2241–2245
Schlagkamp S, Hofmann M, Eufinger L, Da Silva RF (2016) Increasing waiting time satisfaction in parallel job scheduling via a flexible MILP approach. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 164–171
Schlagkamp S (2015) Influence of dynamic think times on parallel job scheduler performances in generative simulations. In: job scheduling strategies for parallel processing, pp 123–137
Acknowledgements
We are thankful for the support from Supercomputing Center (SCC) of USTC. We are thankful to administrators in SCC for data and professional knowledge support. We gratefully acknowledge the computing resources provided by Network and Information Center to run our experiments.
Funding
This research was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA19020102.
Author information
Authors and Affiliations
Contributions
Q.W. involved in conceptualization; Q.W. took part in methodology; Q.W. and H.Z. involved in software; Y.S. and X.L. took part in data curation; Q.W., H.Z. and J.L. took part in writing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Code availability
All the code used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Q., Zhang, H., Li, J. et al. Predicting job finish time based on parameter features and running logs in supercomputing system. J Supercomput 78, 18551–18577 (2022). https://doi.org/10.1007/s11227-022-04582-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04582-5