Skip to main content
Log in

Job runtime prediction of HPC cluster based on PC-Transformer

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Job scheduling of high performance cluster is a crucial task that affects the efficiency and performance of the system. The accuracy of job runtime prediction is one of the key factors that influences the quality of job scheduling. In this paper, we propose a novel method for job runtime prediction based on Transformer with plain connection and attention mechanism. The proposed method utilizes the job category information obtained by clustering the historical log datasets, and selects six-dimensional features that are highly correlated with job runtime. We divide the datasets into multiple job sets according to the length of job runtime, train and predict each job set separately. We evaluate the proposed method on the HPC2N dataset, and compare it with several existing methods. The results show that the proposed method achieves an average accuracy of 0.892, with 15.2% MAPE, and outperforms other methods in terms of prediction performance and training time. The proposed method can be applied to improve the efficiency and quality of job scheduling in high performance cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Availability of data and materials

The datasets used in this paper are all public data sets, which can be obtained openly.

References

  1. Molka D, Hackenberg D, Schöne R, Minartz T, Nagel WE (2012) Flexible workload generation for hpc cluster efficiency benchmarking. Comput Sci Res Dev 27(4):235–243

    Article  Google Scholar 

  2. Grosof I, Yang K, Scully Z, Harchol-Balter M (2021) Nudge: stochastically improving upon fcfs. SIGMETRICS Perform Eval Rev 49(1):11–12. https://doi.org/10.1145/3543516.3460102

    Article  Google Scholar 

  3. Wong AKL, Goscinski AM (2007) Evaluating the easy-backfill job scheduling of static workloads on clusters. In: 2007 IEEE International Conference on Cluster Computing, pp 64–73. https://doi.org/10.1109/CLUSTR.2007.4629218

  4. Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803. https://doi.org/10.1109/TPDS.2007.70606

    Article  Google Scholar 

  5. Fan Y, Rich P, Allcock WE, Papka ME, Lan Z (2017) Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp 530–540. https://doi.org/10.1109/CLUSTER.2017.11

  6. Gaussier E, Glesser D, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1145/2807591.2807646

  7. Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf Sci 490:344–368. https://doi.org/10.1016/j.ins.2019.03.060

    Article  Google Scholar 

  8. Gama J, Aguilar-Ruiz J, Klinkenberg R (2008) Knowledge discovery from data streams. Intell Data Anal 12(3):251–252

    Article  Google Scholar 

  9. Tsafrir D, Etsion Y, Feitelson DG (2005) Modeling user runtime estimates. In: Workshop on Job Scheduling Strategies for Parallel Processing. Springer, pp 1–35. https://doi.org/10.1007/11605300_1

  10. Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116. https://doi.org/10.1007/s10723-011-9179-y

    Article  Google Scholar 

  11. Rauschmayr N (2015) A history-based estimation for lhcb job requirements. J Phys Conf Ser 664:062050. https://doi.org/10.1088/1742-6596/664/6/062050

    Article  Google Scholar 

  12. Park J-W, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651. https://doi.org/10.1007/s11227-017-2038-2

    Article  Google Scholar 

  13. Cunha RLF, Rodrigues ER, Tizzei LP, Netto MAS (2017) Job placement advisor based on turnaround predictions for hpc hybrid clouds. Futur Gener Comput Syst 67:35–46. https://doi.org/10.1016/j.future.2016.08.010

    Article  Google Scholar 

  14. McKenna R, Herbein S, Moody A, Gamblin T, Taufer M (2016) Machine learning predictions of runtime and io traffic on high-end clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp 255–258. https://doi.org/10.1109/CLUSTER.2016.58

  15. Xiujuan S, Xinxiu L, Fasheng L et al (2018) Research on combination prediction model of traffic flow based on entropy weight method. J Shandong Univ Sci Technol (Nat Sci) 37(4):111–117

    Google Scholar 

  16. Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in hpc system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316. https://doi.org/10.1109/ICCCBDA.2019.8725643

  17. Chen X, Zhang H, Bai H, YangC, Zhao X, Li B (2020) Runtime prediction of high-performance computing jobs based on ensemble learning. HP3C 2020. Association for Computing Machinery, pp 56–62. https://doi.org/10.1145/3407947.3407968

  18. Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve hpc scheduling performance. J Supercomput 76(1):122–149. https://doi.org/10.1007/s11227-019-03004-3

    Article  Google Scholar 

  19. Cheon H, Ryu J, Ryou J, Park CY, Han Y-S (2021) Ared: automata-based runtime estimation for distributed systems using deep learning. Clust Comput. https://doi.org/10.1007/s10586-021-03272-w

    Article  Google Scholar 

  20. Grohe M (2020) Word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. PODS’20. Association for Computing Machinery, pp 1–16. https://doi.org/10.1145/3375395.3387641

  21. Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013

    Article  Google Scholar 

  22. Jiang L, Ma M, Wang G (2021) Application of interpolation method in data processing of dangerous cargo transportation in the Yangtze river. In: International Conference on Smart Transportation and City Engineering 2021, vol 12050, pp 445–452. https://doi.org/10.1117/12.2613731. SPIE

  23. Carvalho M, Brasileiro F (2012) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48. https://doi.org/10.1109/Grid.2012.13

  24. Iosup A, Epema D (2011) Grid computing workloads. IEEE Internet Comput 15(2):19–26. https://doi.org/10.1109/MIC.2010.130

    Article  Google Scholar 

  25. Roul RK (2018) An effective approach for semantic-based clustering and topic-based ranking of web documents. Int J Data Sci Anal 5(4):269–284

    Article  Google Scholar 

  26. Xiao YH et al (2019) Ga-sim: a job running time prediction algorithm based on categorization and instance learning. Comput Eng Sci 41(6):6. https://doi.org/10.3969/j.issn.1007-130X.2019.06.005

    Article  Google Scholar 

  27. Zhang X-M, Han Q-L, Ge X, Ding D (2018) An overview of recent developments in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays. Neurocomputing 313:392–401. https://doi.org/10.1016/j.neucom.2018.06.038

    Article  Google Scholar 

  28. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166. https://doi.org/10.1109/72.279181

    Article  Google Scholar 

  29. Balaji E, Brindha D, Elumalai VK, Vikrama R (2021) Automatic and non-invasive Parkinson’s disease diagnosis and severity rating using lstm network. Appl Soft Comput 108:107463. https://doi.org/10.1016/j.asoc.2021.107463

    Article  Google Scholar 

  30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  31. Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091

    Article  Google Scholar 

  32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  33. Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process Lett 26(2):337–341. https://doi.org/10.1109/LSP.2019.2891134

    Article  Google Scholar 

  34. Li M, Soltanolkotabi M, Oymak S (2020) Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol 108, pp 4313–4324. PMLR. https://proceedings.mlr.press/v108/li20j.html

  35. Naghshnejad M, Singhal M (2018) Adaptive online runtime prediction to improve hpc applications latency in cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, pp 762–769

  36. Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars. Proc R Soc A Math Phys Eng Sci 474(2217):20180305. https://doi.org/10.1098/rspa.2018.0305

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by Supercomputing Center of Lanzhou University.

Funding

There is no fund information.

Author information

Authors and Affiliations

Authors

Contributions

Fengxian Chen has finished all the work of the paper.

Corresponding author

Correspondence to Fengxian Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval

There are no ethical problems with the paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, F. Job runtime prediction of HPC cluster based on PC-Transformer. J Supercomput 79, 20208–20234 (2023). https://doi.org/10.1007/s11227-023-05470-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05470-2

Keywords

Navigation