Job runtime prediction of HPC cluster based on PC-Transformer

Chen, Fengxian

doi:10.1007/s11227-023-05470-2

Job runtime prediction of HPC cluster based on PC-Transformer

Published: 12 June 2023

Volume 79, pages 20208–20234, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Fengxian Chen¹

211 Accesses
Explore all metrics

Abstract

Job scheduling of high performance cluster is a crucial task that affects the efficiency and performance of the system. The accuracy of job runtime prediction is one of the key factors that influences the quality of job scheduling. In this paper, we propose a novel method for job runtime prediction based on Transformer with plain connection and attention mechanism. The proposed method utilizes the job category information obtained by clustering the historical log datasets, and selects six-dimensional features that are highly correlated with job runtime. We divide the datasets into multiple job sets according to the length of job runtime, train and predict each job set separately. We evaluate the proposed method on the HPC2N dataset, and compare it with several existing methods. The results show that the proposed method achieves an average accuracy of 0.892, with 15.2% MAPE, and outperforms other methods in terms of prediction performance and training time. The proposed method can be applied to improve the efficiency and quality of job scheduling in high performance cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning: Algorithms, Real-World Applications and Research Directions

Article 22 March 2021

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

A survey of transfer learning

Article Open access 28 May 2016

Availability of data and materials

The datasets used in this paper are all public data sets, which can be obtained openly.

References

Molka D, Hackenberg D, Schöne R, Minartz T, Nagel WE (2012) Flexible workload generation for hpc cluster efficiency benchmarking. Comput Sci Res Dev 27(4):235–243
Article Google Scholar
Grosof I, Yang K, Scully Z, Harchol-Balter M (2021) Nudge: stochastically improving upon fcfs. SIGMETRICS Perform Eval Rev 49(1):11–12. https://doi.org/10.1145/3543516.3460102
Article Google Scholar
Wong AKL, Goscinski AM (2007) Evaluating the easy-backfill job scheduling of static workloads on clusters. In: 2007 IEEE International Conference on Cluster Computing, pp 64–73. https://doi.org/10.1109/CLUSTR.2007.4629218
Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803. https://doi.org/10.1109/TPDS.2007.70606
Article Google Scholar
Fan Y, Rich P, Allcock WE, Papka ME, Lan Z (2017) Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp 530–540. https://doi.org/10.1109/CLUSTER.2017.11
Gaussier E, Glesser D, Reis V, Trystram D (2015) Improving backfilling by using machine learning to predict running times. In: SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1145/2807591.2807646
Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf Sci 490:344–368. https://doi.org/10.1016/j.ins.2019.03.060
Article Google Scholar
Gama J, Aguilar-Ruiz J, Klinkenberg R (2008) Knowledge discovery from data streams. Intell Data Anal 12(3):251–252
Article Google Scholar
Tsafrir D, Etsion Y, Feitelson DG (2005) Modeling user runtime estimates. In: Workshop on Job Scheduling Strategies for Parallel Processing. Springer, pp 1–35. https://doi.org/10.1007/11605300_1
Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. J Grid Comput 9(1):95–116. https://doi.org/10.1007/s10723-011-9179-y
Article Google Scholar
Rauschmayr N (2015) A history-based estimation for lhcb job requirements. J Phys Conf Ser 664:062050. https://doi.org/10.1088/1742-6596/664/6/062050
Article Google Scholar
Park J-W, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651. https://doi.org/10.1007/s11227-017-2038-2
Article Google Scholar
Cunha RLF, Rodrigues ER, Tizzei LP, Netto MAS (2017) Job placement advisor based on turnaround predictions for hpc hybrid clouds. Futur Gener Comput Syst 67:35–46. https://doi.org/10.1016/j.future.2016.08.010
Article Google Scholar
McKenna R, Herbein S, Moody A, Gamblin T, Taufer M (2016) Machine learning predictions of runtime and io traffic on high-end clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp 255–258. https://doi.org/10.1109/CLUSTER.2016.58
Xiujuan S, Xinxiu L, Fasheng L et al (2018) Research on combination prediction model of traffic flow based on entropy weight method. J Shandong Univ Sci Technol (Nat Sci) 37(4):111–117
Google Scholar
Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on input parameters in hpc system. In: 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp 311–316. https://doi.org/10.1109/ICCCBDA.2019.8725643
Chen X, Zhang H, Bai H, YangC, Zhao X, Li B (2020) Runtime prediction of high-performance computing jobs based on ensemble learning. HP3C 2020. Association for Computing Machinery, pp 56–62. https://doi.org/10.1145/3407947.3407968
Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve hpc scheduling performance. J Supercomput 76(1):122–149. https://doi.org/10.1007/s11227-019-03004-3
Article Google Scholar
Cheon H, Ryu J, Ryou J, Park CY, Han Y-S (2021) Ared: automata-based runtime estimation for distributed systems using deep learning. Clust Comput. https://doi.org/10.1007/s10586-021-03272-w
Article Google Scholar
Grohe M (2020) Word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. PODS’20. Association for Computing Machinery, pp 1–16. https://doi.org/10.1145/3375395.3387641
Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013
Article Google Scholar
Jiang L, Ma M, Wang G (2021) Application of interpolation method in data processing of dangerous cargo transportation in the Yangtze river. In: International Conference on Smart Transportation and City Engineering 2021, vol 12050, pp 445–452. https://doi.org/10.1117/12.2613731. SPIE
Carvalho M, Brasileiro F (2012) A user-based model of grid computing workloads. In: 2012 ACM/IEEE 13th International Conference on Grid Computing, pp 40–48. https://doi.org/10.1109/Grid.2012.13
Iosup A, Epema D (2011) Grid computing workloads. IEEE Internet Comput 15(2):19–26. https://doi.org/10.1109/MIC.2010.130
Article Google Scholar
Roul RK (2018) An effective approach for semantic-based clustering and topic-based ranking of web documents. Int J Data Sci Anal 5(4):269–284
Article Google Scholar
Xiao YH et al (2019) Ga-sim: a job running time prediction algorithm based on categorization and instance learning. Comput Eng Sci 41(6):6. https://doi.org/10.3969/j.issn.1007-130X.2019.06.005
Article Google Scholar
Zhang X-M, Han Q-L, Ge X, Ding D (2018) An overview of recent developments in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays. Neurocomputing 313:392–401. https://doi.org/10.1016/j.neucom.2018.06.038
Article Google Scholar
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166. https://doi.org/10.1109/72.279181
Article Google Scholar
Balaji E, Brindha D, Elumalai VK, Vikrama R (2021) Automatic and non-invasive Parkinson’s disease diagnosis and severity rating using lstm network. Appl Soft Comput 108:107463. https://doi.org/10.1016/j.asoc.2021.107463
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process Lett 26(2):337–341. https://doi.org/10.1109/LSP.2019.2891134
Article Google Scholar
Li M, Soltanolkotabi M, Oymak S (2020) Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol 108, pp 4313–4324. PMLR. https://proceedings.mlr.press/v108/li20j.html
Naghshnejad M, Singhal M (2018) Adaptive online runtime prediction to improve hpc applications latency in cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, pp 762–769
Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars. Proc R Soc A Math Phys Eng Sci 474(2217):20180305. https://doi.org/10.1098/rspa.2018.0305
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by Supercomputing Center of Lanzhou University.

Funding

There is no fund information.

Author information

Authors and Affiliations

Supercomputing Center, Lanzhou university, Tianshui Road, Lanzhou, 730000, Gansu, China
Fengxian Chen

Authors

Fengxian Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Fengxian Chen has finished all the work of the paper.

Corresponding author

Correspondence to Fengxian Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethics approval

There are no ethical problems with the paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, F. Job runtime prediction of HPC cluster based on PC-Transformer. J Supercomput 79, 20208–20234 (2023). https://doi.org/10.1007/s11227-023-05470-2

Download citation

Accepted: 31 May 2023
Published: 12 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11227-023-05470-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Job runtime prediction of HPC cluster based on PC-Transformer

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey of transfer learning

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Job runtime prediction of HPC cluster based on PC-Transformer

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

A survey of transfer learning

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation