Abstract
Job scheduling of high performance cluster is a crucial task that affects the efficiency and performance of the system. The accuracy of job runtime prediction is one of the key factors that influences the quality of job scheduling. In this paper, we propose a novel method for job runtime prediction based on Transformer with plain connection and attention mechanism. The proposed method utilizes the job category information obtained by clustering the historical log datasets, and selects six-dimensional features that are highly correlated with job runtime. We divide the datasets into multiple job sets according to the length of job runtime, train and predict each job set separately. We evaluate the proposed method on the HPC2N dataset, and compare it with several existing methods. The results show that the proposed method achieves an average accuracy of 0.892, with 15.2% MAPE, and outperforms other methods in terms of prediction performance and training time. The proposed method can be applied to improve the efficiency and quality of job scheduling in high performance cluster.
Similar content being viewed by others
Availability of data and materials
The datasets used in this paper are all public data sets, which can be obtained openly.
References
Molka D, Hackenberg D, Schöne R, Minartz T, Nagel WE (2012) Flexible workload generation for hpc cluster efficiency benchmarking. Comput Sci Res Dev 27(4):235–243
Grosof I, Yang K, Scully Z, Harchol-Balter M (2021) Nudge: stochastically improving upon fcfs. SIGMETRICS Perform Eval Rev 49(1):11–12. https://doi.org/10.1145/3543516.3460102
Wong AKL, Goscinski AM (2007) Evaluating the easy-backfill job scheduling of static workloads on clusters. In: 2007 IEEE International Conference on Cluster Computing, pp 64–73. https://doi.org/10.1109/CLUSTR.2007.4629218
Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803. https://doi.org/10.1109/TPDS.2007.70606
Fan Y, Rich P, Allcock WE, Papka ME, Lan Z (2017) Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp 530–540. https://doi.org/10.1109/CLUSTER.2017.11
Gaussier E, Glesser D, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1145/2807591.2807646
Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf Sci 490:344–368. https://doi.org/10.1016/j.ins.2019.03.060
Gama J, Aguilar-Ruiz J, Klinkenberg R (2008) Knowledge discovery from data streams. Intell Data Anal 12(3):251–252
Tsafrir D, Etsion Y, Feitelson DG (2005) Modeling user runtime estimates. In: Workshop on Job Scheduling Strategies for Parallel Processing. Springer, pp 1–35. https://doi.org/10.1007/11605300_1
Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116. https://doi.org/10.1007/s10723-011-9179-y
Rauschmayr N (2015) A history-based estimation for lhcb job requirements. J Phys Conf Ser 664:062050. https://doi.org/10.1088/1742-6596/664/6/062050
Park J-W, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651. https://doi.org/10.1007/s11227-017-2038-2
Cunha RLF, Rodrigues ER, Tizzei LP, Netto MAS (2017) Job placement advisor based on turnaround predictions for hpc hybrid clouds. Futur Gener Comput Syst 67:35–46. https://doi.org/10.1016/j.future.2016.08.010
McKenna R, Herbein S, Moody A, Gamblin T, Taufer M (2016) Machine learning predictions of runtime and io traffic on high-end clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp 255–258. https://doi.org/10.1109/CLUSTER.2016.58
Xiujuan S, Xinxiu L, Fasheng L et al (2018) Research on combination prediction model of traffic flow based on entropy weight method. J Shandong Univ Sci Technol (Nat Sci) 37(4):111–117
Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in hpc system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316. https://doi.org/10.1109/ICCCBDA.2019.8725643
Chen X, Zhang H, Bai H, YangC, Zhao X, Li B (2020) Runtime prediction of high-performance computing jobs based on ensemble learning. HP3C 2020. Association for Computing Machinery, pp 56–62. https://doi.org/10.1145/3407947.3407968
Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve hpc scheduling performance. J Supercomput 76(1):122–149. https://doi.org/10.1007/s11227-019-03004-3
Cheon H, Ryu J, Ryou J, Park CY, Han Y-S (2021) Ared: automata-based runtime estimation for distributed systems using deep learning. Clust Comput. https://doi.org/10.1007/s10586-021-03272-w
Grohe M (2020) Word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. PODS’20. Association for Computing Machinery, pp 1–16. https://doi.org/10.1145/3375395.3387641
Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013
Jiang L, Ma M, Wang G (2021) Application of interpolation method in data processing of dangerous cargo transportation in the Yangtze river. In: International Conference on Smart Transportation and City Engineering 2021, vol 12050, pp 445–452. https://doi.org/10.1117/12.2613731. SPIE
Carvalho M, Brasileiro F (2012) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48. https://doi.org/10.1109/Grid.2012.13
Iosup A, Epema D (2011) Grid computing workloads. IEEE Internet Comput 15(2):19–26. https://doi.org/10.1109/MIC.2010.130
Roul RK (2018) An effective approach for semantic-based clustering and topic-based ranking of web documents. Int J Data Sci Anal 5(4):269–284
Xiao YH et al (2019) Ga-sim: a job running time prediction algorithm based on categorization and instance learning. Comput Eng Sci 41(6):6. https://doi.org/10.3969/j.issn.1007-130X.2019.06.005
Zhang X-M, Han Q-L, Ge X, Ding D (2018) An overview of recent developments in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays. Neurocomputing 313:392–401. https://doi.org/10.1016/j.neucom.2018.06.038
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166. https://doi.org/10.1109/72.279181
Balaji E, Brindha D, Elumalai VK, Vikrama R (2021) Automatic and non-invasive Parkinson’s disease diagnosis and severity rating using lstm network. Appl Soft Comput 108:107463. https://doi.org/10.1016/j.asoc.2021.107463
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process Lett 26(2):337–341. https://doi.org/10.1109/LSP.2019.2891134
Li M, Soltanolkotabi M, Oymak S (2020) Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol 108, pp 4313–4324. PMLR. https://proceedings.mlr.press/v108/li20j.html
Naghshnejad M, Singhal M (2018) Adaptive online runtime prediction to improve hpc applications latency in cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, pp 762–769
Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars. Proc R Soc A Math Phys Eng Sci 474(2217):20180305. https://doi.org/10.1098/rspa.2018.0305
Acknowledgements
This work is supported by Supercomputing Center of Lanzhou University.
Funding
There is no fund information.
Author information
Authors and Affiliations
Contributions
Fengxian Chen has finished all the work of the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethics approval
There are no ethical problems with the paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, F. Job runtime prediction of HPC cluster based on PC-Transformer. J Supercomput 79, 20208–20234 (2023). https://doi.org/10.1007/s11227-023-05470-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05470-2