Improving Accuracy of Walltime Estimates in PBS Professional Using Soft Walltimes

Chlumský, Václav; Klusáček, Dalibor

doi:10.1007/978-3-031-22698-4_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13592))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

282 Accesses

Abstract

Job walltime estimates are used by current batch schedulers to optimize the performance and predictability when scheduling parallel jobs on the computing resources. Since the user-provided estimates are inaccurate and often overestimated, system administrators often seek ways to improve them artificially using some form of walltime predictor. In this work, we present our real-life experience with deploying such a predictor using the soft walltime feature available in PBS Professional resource manager. Our results indicate that the applied solution is working properly, significantly increasing the accuracy of user-provided estimates. We share our experience when tuning the scheduler, discussing several problems that occurred along the way. Also, we provide a comparison of how the system behavior evolved once soft walltimes were deployed in production. Last but not least, we publish collected workload traces along with this paper to allow other researchers to further study and extend our work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In PBS Professional, not every waiting job gets a reservation. Only a predefined number of high priority jobs (per queue) has guaranteed (latest) start times and these are called top jobs. Remaining jobs can be backfilled around top jobs provided they will not interfere with their reservations.
2.
In this context, job batch is the set of jobs submitted into the system by a given user in a short time frame, e.g., during few minutes.
3.
The difference is caused by the fact that it takes some time before we collect enough data for each user to produce soft walltimes.
4.
In our case, those are 2, 4 and 24 h, 2, 4 and 7 days and 2, 4 or >4 weeks.
5.
Other values such as 10 s [3, 4] or 1 min [18] are used as well in the literature. In CERIT-SC, 10 min is the recommended minimal runtime of regular job. Shorter jobs are not recommended due to excessive overhead related to their (frequent) processing.
6.
This is also coupled with fair-share based job ordering which we use to prioritize less active users over those who utilize the system heavily.

References

CERIT Scientific Cloud, August 2022. http://www.cerit-sc.cz
Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7
Chapter MATH Google Scholar
Feitelson, D.G.: Experimental analysis of the root causes of performance evaluation results: a backfilling case study. IEEE Trans. Parallel Distrib. Syst. 16(2), 175–182 (2005)
Article Google Scholar
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Chapter Google Scholar
Soft walltime predictor implementation, August 2022. https://github.com/CESNET/softwalltime-predictor/
JSSPP workloads archive, August 2022. https://jsspp.org/workload/
Klusáček, D., Chlumský, V.: Evaluating the impact of soft walltimes on job scheduling performance. In: Desai, N., Klusáček, D., Cirne, W. (eds.) JSSPP 2018. LNCS, vol. 11332, pp. 15–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-10632-4_2
Chapter Google Scholar
Lee, C.B., Schwartzman, Y., Hardy, J., Snavely, A.: Are user runtime estimates inherently inaccurate? In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 253–263. Springer, Heidelberg (2004). https://doi.org/10.1007/11407522_14
Chapter Google Scholar
MetaCentrum, February 2022. http://www.metacentrum.cz/
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Seneviratne, S., Witharana, S.: A survey on methodologies for runtime prediction on grid environments. In: 7th International Conference on Information and Automation for Sustainability, pp. 1–6. IEEE (2014)
Google Scholar
Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11
Chapter Google Scholar
Non-destructive walltime, February 2022. https://community.openpbs.org/t/pp-482-non-destructive-walltime/587/4
Soft walltime documentation, February 2022. https://pbspro.atlassian.net/wiki/spaces/PD/pages/42532871/PP-482+Soft+Walltime
Soysal, M., Bergho, M., Streit, A.: Analysis of job metadata for enhanced wall time prediction. In: Desai, N., Klusáček, D., Cirne, W. (eds.) JSSPP 2018. LNCS, vol. 11332, pp. 1–14. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-10632-4_1
Chapter Google Scholar
Tsafrir, D.: Using inaccurate estimates accurately. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253, pp. 208–221. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16505-4_12
Chapter Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Article Google Scholar
Vasupongayya, S., Chiang, S.-H.: On job fairness in non-preemptive parallel job scheduling. In: Zheng, S.Q. (ed.) International Conference on Parallel and Distributed Computing Systems (PDCS 2005), pp. 100–105. IASTED/ACTA Press (2005)
Google Scholar

Download references

Acknowledgments

We kindly acknowledge the support and computational resources supplied by the project “e-Infrastruktura CZ” (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

CESNET a.l.e., Brno, Czech Republic
Václav Chlumský & Dalibor Klusáček

Authors

Václav Chlumský
View author publications
You can also search for this author in PubMed Google Scholar
Dalibor Klusáček
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dalibor Klusáček .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Polytechnic University of Catalonia, Barcelona, Spain
Corbalán Julita
Apple, Cupertino, CA, USA
Gonzalo P. Rodrigo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chlumský, V., Klusáček, D. (2023). Improving Accuracy of Walltime Estimates in PBS Professional Using Soft Walltimes. In: Klusáček, D., Julita, C., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2022. Lecture Notes in Computer Science, vol 13592. Springer, Cham. https://doi.org/10.1007/978-3-031-22698-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-22698-4_10
Published: 12 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22697-7
Online ISBN: 978-3-031-22698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Accuracy of Walltime Estimates in PBS Professional Using Soft Walltimes