Predicting job finish time based on parameter features and running logs in supercomputing system

Wang, Qiqi; Zhang, Hongjie; Li, Jing; Shen, Yu; Liu, Xiaohui

doi:10.1007/s11227-022-04582-5

Predicting job finish time based on parameter features and running logs in supercomputing system

Published: 08 June 2022

Volume 78, pages 18551–18577, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Qiqi Wang¹,
Hongjie Zhang¹,
Jing Li ORCID: orcid.org/0000-0001-6761-7687^1,2,
Yu Shen² &
…
Xiaohui Liu²

469 Accesses
5 Citations
Explore all metrics

Abstract

Accurate job finish time estimation is one of the key parts of scheduling strategy design in supercomputing systems. Existing research works concentrate on designing a better or more complex machine learning model to achieve accurate job runtime prediction based on the non-job-specific parameters. These parameters include the number of processors consumed, the user-estimated runtime, job submit time, job ID, and so on. However, we can extract more useful information from the system logs to assist the runtime prediction. The system logs in supercomputing always contain the intermediate output results and input parameters, which motivate us to analyze the running status of the job and predict the job finish time. Since VASP is one of the most popular supercomputing applications in the world, in this paper, we conduct the first investigation into running features and deeply analyze the job-specific parameters. Based on the running and job-specific features, we propose RunningNet, a dynamic finish time prediction model during job running, which contains the running features represented by a time series and the parameters features. Experiments on the VASP job set in the supercomputing system at USTC show that RunningNet achieves state-of-the-art results. The Mean Average Percentage Error metric reaches about 10.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Optimizing job scheduling by using broad learning to predict execution times on HPC clusters

Article 23 February 2023

Online Runtime Prediction Method for Distributed Iterative Jobs

Data availability

All the data used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

Notes

https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

References

Geist A, Reed DA (2017) A survey of high-performance computing scaling challenges. Int J High Perform Comput Appl 31(1):104–113
Article Google Scholar
Patel T, Liu Z, Kettimuthu R, Rich P, Allcock W, Tiwari D (2020) Job characteristics on large-scale systems: long-term analysis, quantification and implications. SC20: International Conference for High Performance Computing. Networking, Storage and Analysis (SC), pp 1–17
Chiang SH, Arpaci-Dusseau A, Vernon MK (2002) The impact of more accurate requested runtimes on production job scheduling performance. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 103–127
Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803
Article Google Scholar
Gaussier E, GlesserD, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10
The Vienna Ab initio Simulation Package. Available online: https://www.vasp.at/ Accessed on 1 Feb 2022
Supercomputing Center of USTC. Available online: http://scc.ustc.edu.cn/ Accessed on 1 Feb 2022
Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning, in Proceedings of the 15th ACM workshop on hot topics in networks, pp 50–56
Vasupongayya S, Chiang SH (2007) Performance problems of using system-predicted runtimes for parallel job scheduling. In: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems, pp 369–374
Pumma S, Feng W, Phunchongharn P, Chapeland S, Achalakul T (2017) A runtime estimation framework for alice. Future Gener Comput Syst 72:65–77
Article Google Scholar
Minh TN, Wolters L (2010) Using historical data to predict application runtimes on backfilling parallel systems. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 246–252
Liang F, Liu Y, Liu H, Ma S, Schnor B (2015) A parallel job execution time estimation approach based on user submission patterns within computational grids. Int J Parallel Program 43(3):440–454
Article Google Scholar
Tsafrir D, Etsion Y, Feitelson DG (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982
Article Google Scholar
Wyatt M, Herbein S, Ahn D, Moody A, Gamblin T, Taufer M (2017) Unstructured Data Analytics for Next-generation HPC Schedulers: Capturing Jobs’ Needs from Unstructured Job Scripts (No. LLNL-CONF-728884). Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)
Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in HPC system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794
Wu G, Shen Y (2019) Runtime prediction of jobs for backfilling optimization. J Chinese Comput Syst 40:6–12
Google Scholar
Carvalho M, Brasileiro F (2013) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48
Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge
Book MATH Google Scholar
Tang W, Desai N, Buettner D, Lan Z (2013) Job scheduling with adjusted runtime estimates on production supercomputers. J Parallel Distrib Comput 73(7):926–938
Article Google Scholar
Smith W, Foster I, Taylor V (2004) Predicting application run times with historical information. J Parallel Distrib Comput 64(9):1007–1016
Article MATH Google Scholar
Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116
Article Google Scholar
Cunha RL, Rodrigues ER, Tizzei LP, Netto MA (2017) Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Gener Comput Syst 67:35–46
Article Google Scholar
Phinjaroenphan P, Bevinakoppa S, Zeephongsekul PA (2005) A method for estimating the execution time of a parallel task on a grid node. In: European Grid Conference, pp 226–236
Chen X, Lu CD, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W), pp 1–8
Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651
Article Google Scholar
Li J, Ma X, Singh K, Schulz M, de Supinski BR, McKee SA (2009) Machine learning based online performance prediction for runtime parallelization and task scheduling. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 89–100
Matsunaga A, Fortes JA (2010) On the use of machine learning to predict the time and resources consumed by applications. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 495–504
Gupta C, Mehta A, Dayal U (2008) PQR: predicting query execution times for autonomous workload management. In: 2008 International Conference on Autonomic Computing, Cloud and Grid Computing, pp 13–22
Miu T, Missier P (2012) Predicting the execution time of workflow activities based on their input features. In: 2012 SC Companion: high performance computing, networking storage and analysis, pp 64–72
Sun J, Sun G, Zhan S, Zhang J, Chen J (2020) Automated performance modeling of HPC applications using machine learning. IEEE Trans Comput 69(5):749–763
Article MATH Google Scholar
Elmroth E, Tordsson J (2009) Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions. Future Gener Comput Syst 24(6):585–593
Article Google Scholar
Feng J, Liu G, Zhang Z, Li T, Li Y, Sun F (2018) Quota-constrained job submission behavior at commercial supercomputer. In: Conference on Advanced Computer Architecture, pp 219–231
Schlagkamp S, Da Silva RF, Renker J, Rinkenauer G (2016) Analyzing users in parallel computing: a user-oriented study. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 295–402
Schlagkamp S, Da Silva RF, Deelman E, Schwiegelshohn U (2016) Understanding user behavior: from HPC to HTC. Procedia Comput Sci 80:2241–2245
Article Google Scholar
Schlagkamp S, Hofmann M, Eufinger L, Da Silva RF (2016) Increasing waiting time satisfaction in parallel job scheduling via a flexible MILP approach. In: 2016 International Conference on High Performance Computing and Simulation (HPCS), pp 164–171
Schlagkamp S (2015) Influence of dynamic think times on parallel job scheduler performances in generative simulations. In: job scheduling strategies for parallel processing, pp 123–137

Download references

Acknowledgements

We are thankful for the support from Supercomputing Center (SCC) of USTC. We are thankful to administrators in SCC for data and professional knowledge support. We gratefully acknowledge the computing resources provided by Network and Information Center to run our experiments.

Funding

This research was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA19020102.

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, Anhui, China
Qiqi Wang, Hongjie Zhang & Jing Li
Supercomputing Center, University of Science and Technology of China, Hefei, 230026, Anhui, China
Jing Li, Yu Shen & Xiaohui Liu

Authors

Qiqi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Hongjie Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Jing Li
View author publications
You can also search for this author inPubMed Google Scholar
Yu Shen
View author publications
You can also search for this author inPubMed Google Scholar
Xiaohui Liu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Q.W. involved in conceptualization; Q.W. took part in methodology; Q.W. and H.Z. involved in software; Y.S. and X.L. took part in data curation; Q.W., H.Z. and J.L. took part in writing.

Corresponding author

Correspondence to Jing Li.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Code availability

All the code used in this study are available at: https://www.dropbox.com/sh/9bfbdtiij0jafth/AABWz78PmGq3nAgYNBuC5O2-a?dl=0.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Q., Zhang, H., Li, J. et al. Predicting job finish time based on parameter features and running logs in supercomputing system. J Supercomput 78, 18551–18577 (2022). https://doi.org/10.1007/s11227-022-04582-5

Download citation

Accepted: 30 April 2022
Published: 08 June 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11227-022-04582-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting job finish time based on parameter features and running logs in supercomputing system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Optimizing job scheduling by using broad learning to predict execution times on HPC clusters

Online Runtime Prediction Method for Distributed Iterative Jobs

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now