An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics

Rosa, Lucas; Carastan-Santos, Danilo; Goldman, Alfredo

doi:10.1007/978-3-031-43943-8_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14283))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

417 Accesses

Abstract

Scheduling jobs in High-Performance Computing (HPC) platforms typically involves heuristics consisting of job sorting functions such as First-Come-First-Served or custom (hand-engineered). Linear regression methods are promising for exploiting scheduling data to create simple and transparent heuristics with lesser computational overhead than state-of-the-art learning methods. The drawback is lesser scheduling performance. We experimentally investigated the hypothesis that we could increase the scheduling performance of regression-obtained heuristics by increasing the complexity of the sorting functions and exploiting derivative job features. We used multiple linear regression to develop a factory of scheduling heuristics based on scheduling data. This factory uses general polynomials of the jobs’ characteristics as templates for the scheduling heuristics. We defined a set of polynomials with increasing complexity between them, and we used our factory to create scheduling heuristics based on these polynomials. We evaluated the performance of the obtained heuristics with wide-range simulation experiments using real-world traces from 1997 to 2016. Our results show that large-sized polynomials led to unstable scheduling heuristics due to multicollinearity effects in the regression, with small-sized polynomials leading to a stable and efficient scheduling performance. These results conclude that (i) multicollinearity imposes a constraint when one wants to derive new features (i.e., feature engineering) for creating scheduling heuristics with regression, and (ii) regression-obtained scheduling heuristics can be resilient to the long-term evolution of HPC platforms and workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/fredgrub/scheduling-simulator.

References

Akinwande, M.O., Dikko, H.G., Samson, A.: Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis. Open J. Stat. 05, 754–767 (2015). https://doi.org/10.4236/ojs.2015.57075
Article Google Scholar
Alin, A.: Multicollinearity. Wiley Interdisc. Rev. Comput. Stat. 2, 370–374 (2010). https://doi.org/10.1002/wics.84
Article Google Scholar
Amvrosiadis, G., et al.: The atlas cluster trace repository. Usenix Mag. 43(4) (2018)
Google Scholar
Baker, B.S., Coffman, E.G., Jr., Rivest, R.L.: Orthogonal packings in two dimensions. SIAM J. Comput. 9(4), 846–855 (1980)
Article MathSciNet MATH Google Scholar
Bougeret, M., Dutot, P., Jansen, K., Otte, C., Trystram, D.: Approximation algorithms for multiple strip packing. In: Approximation and Online Algorithms, 7th International Workshop, WAOA 2009, Copenhagen, Denmark, September 10–11, 2009. Revised Papers, pp. 37–48 (2009). https://doi.org/10.1007/978-3-642-12450-1_4
Carastan-Santos, D., de Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 32:1–32:13. SC 2017, ACM, New York (2017). https://doi.org/10.1145/3126908.3126955
Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: a simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10 (2019). https://doi.org/10.1109/CCGRID.2019.00010
Carroll, R., Ruppert, D.: Transformation and Weighting in Regression. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis (1988), https://books.google.com.br/books?id=I5rGEPJd57AC
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014)
Article Google Scholar
Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 807–816 (2021). https://doi.org/10.1109/IPDPS49936.2021.00090
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11
Chapter MATH Google Scholar
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Chapter Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 64:1–64:10. SC 2015, ACM, New York (2015). https://doi.org/10.1145/2807591.2807646
Georgiou, Y.: Resource and job management in high performance computing, Ph. D. thesis, Joseph Fourier University, France (2010)
Google Scholar
Hurink, J.L., Paulus, J.J.: Online algorithm for parallel job scheduling and strip packing. In: Kaklamanis, C., Skutella, M. (eds.) WAOA 2007. LNCS, vol. 4927, pp. 67–74. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77918-6_6
Chapter MATH Google Scholar
Legrand, A., Trystram, D., Zrigui, S.: Adapting batch scheduling to workload characteristics: What can we expect from online learning? In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 686–695 (2019). https://doi.org/10.1109/IPDPS.2019.00077
Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2017. LNCS, vol. 10773, pp. 43–61. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77398-8_3
Chapter Google Scholar
Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., Hu, C.: OKCM: improving parallel task scheduling in high-performance computing systems using online learning. J. Supercomput. 77(6), 5960–5983 (2021)
Article Google Scholar
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003). https://doi.org/10.1016/S0743-7315(03)00108-4
Article MATH Google Scholar
Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: TOP500 Supercomputer Sites (2023). https://www.top500.org/. Access 21 Feb 2023
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Pinedo, M.L.: Scheduling: Theory, Algorithms, and Systems. Springer (2016)
Google Scholar
Rodrigo, G.P., Östberg, P.O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)
Article Google Scholar
Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on BlueGene/P systems. In: Cluster Computing and Workshops, 2009. CLUSTER 2009. IEEE International Conference on, pp. 1–10. IEEE (2009)
Google Scholar
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar
Ye, D., Han, X., Zhang, G.: Online multiple-strip packing. Theoret. Comput. Sci. 412(3), 233–239 (2011). https://doi.org/10.1016/j.tcs.2009.09.029. http://www.sciencedirect.com/science/article/pii/S0304397509006896
Article MathSciNet MATH Google Scholar
Ye, D., Zhang, G.: On-line scheduling of parallel jobs in a list. J. Sched. 10(6), 407–413 (2007)
Article MathSciNet MATH Google Scholar
Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020). https://doi.org/10.1109/SC41405.2020.00035
Zhuk, S.: Approximate algorithms to pack rectangles into several strips. Discrete Math. Appl. 16(1), 73–85 (2006)
Article MathSciNet MATH Google Scholar
Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parallel Distrib. Comput. 164, 83–95 (2022). https://doi.org/10.1016/j.jpdc.2022.01.003. https://www.sciencedirect.com/science/article/pii/S0743731522000090
Article Google Scholar

Download references

Acknowledgement

This research was supported by the EuroHPC EU Regale project (g.a. 956560), São Paulo Research Foundation (FAPESP, grants 19/26702-8 and 22/06906-0), and the MIAI Grenoble-Alpes institute (ANR project number 19-P3IA-0003).

Author information

Authors and Affiliations

Institute of Mathematics and Statistics, University of São Paulo, Sao Paulo, Brazil
Lucas Rosa & Alfredo Goldman
University Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, Grenoble, France
Danilo Carastan-Santos

Authors

Lucas Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Carastan-Santos
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Goldman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alfredo Goldman .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Barcelona Supercomputing Center, Barcelona, Spain
Julita Corbalán
Apple, Cupertino, CA, USA
Gonzalo P. Rodrigo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rosa, L., Carastan-Santos, D., Goldman, A. (2023). An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-43943-8_6
Published: 15 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43942-1
Online ISBN: 978-3-031-43943-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics