Abstract
Scheduling jobs in High-Performance Computing (HPC) platforms typically involves heuristics consisting of job sorting functions such as First-Come-First-Served or custom (hand-engineered). Linear regression methods are promising for exploiting scheduling data to create simple and transparent heuristics with lesser computational overhead than state-of-the-art learning methods. The drawback is lesser scheduling performance. We experimentally investigated the hypothesis that we could increase the scheduling performance of regression-obtained heuristics by increasing the complexity of the sorting functions and exploiting derivative job features. We used multiple linear regression to develop a factory of scheduling heuristics based on scheduling data. This factory uses general polynomials of the jobs’ characteristics as templates for the scheduling heuristics. We defined a set of polynomials with increasing complexity between them, and we used our factory to create scheduling heuristics based on these polynomials. We evaluated the performance of the obtained heuristics with wide-range simulation experiments using real-world traces from 1997 to 2016. Our results show that large-sized polynomials led to unstable scheduling heuristics due to multicollinearity effects in the regression, with small-sized polynomials leading to a stable and efficient scheduling performance. These results conclude that (i) multicollinearity imposes a constraint when one wants to derive new features (i.e., feature engineering) for creating scheduling heuristics with regression, and (ii) regression-obtained scheduling heuristics can be resilient to the long-term evolution of HPC platforms and workloads.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akinwande, M.O., Dikko, H.G., Samson, A.: Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis. Open J. Stat. 05, 754–767 (2015). https://doi.org/10.4236/ojs.2015.57075
Alin, A.: Multicollinearity. Wiley Interdisc. Rev. Comput. Stat. 2, 370–374 (2010). https://doi.org/10.1002/wics.84
Amvrosiadis, G., et al.: The atlas cluster trace repository. Usenix Mag. 43(4) (2018)
Baker, B.S., Coffman, E.G., Jr., Rivest, R.L.: Orthogonal packings in two dimensions. SIAM J. Comput. 9(4), 846–855 (1980)
Bougeret, M., Dutot, P., Jansen, K., Otte, C., Trystram, D.: Approximation algorithms for multiple strip packing. In: Approximation and Online Algorithms, 7th International Workshop, WAOA 2009, Copenhagen, Denmark, September 10–11, 2009. Revised Papers, pp. 37–48 (2009). https://doi.org/10.1007/978-3-642-12450-1_4
Carastan-Santos, D., de Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 32:1–32:13. SC 2017, ACM, New York (2017). https://doi.org/10.1145/3126908.3126955
Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: a simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10 (2019). https://doi.org/10.1109/CCGRID.2019.00010
Carroll, R., Ruppert, D.: Transformation and Weighting in Regression. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis (1988), https://books.google.com.br/books?id=I5rGEPJd57AC
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014)
Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 807–816 (2021). https://doi.org/10.1109/IPDPS49936.2021.00090
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 64:1–64:10. SC 2015, ACM, New York (2015). https://doi.org/10.1145/2807591.2807646
Georgiou, Y.: Resource and job management in high performance computing, Ph. D. thesis, Joseph Fourier University, France (2010)
Hurink, J.L., Paulus, J.J.: Online algorithm for parallel job scheduling and strip packing. In: Kaklamanis, C., Skutella, M. (eds.) WAOA 2007. LNCS, vol. 4927, pp. 67–74. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77918-6_6
Legrand, A., Trystram, D., Zrigui, S.: Adapting batch scheduling to workload characteristics: What can we expect from online learning? In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 686–695 (2019). https://doi.org/10.1109/IPDPS.2019.00077
Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2017. LNCS, vol. 10773, pp. 43–61. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77398-8_3
Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., Hu, C.: OKCM: improving parallel task scheduling in high-performance computing systems using online learning. J. Supercomput. 77(6), 5960–5983 (2021)
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003). https://doi.org/10.1016/S0743-7315(03)00108-4
Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: TOP500 Supercomputer Sites (2023). https://www.top500.org/. Access 21 Feb 2023
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Pinedo, M.L.: Scheduling: Theory, Algorithms, and Systems. Springer (2016)
Rodrigo, G.P., Östberg, P.O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)
Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on BlueGene/P systems. In: Cluster Computing and Workshops, 2009. CLUSTER 2009. IEEE International Conference on, pp. 1–10. IEEE (2009)
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Ye, D., Han, X., Zhang, G.: Online multiple-strip packing. Theoret. Comput. Sci. 412(3), 233–239 (2011). https://doi.org/10.1016/j.tcs.2009.09.029. http://www.sciencedirect.com/science/article/pii/S0304397509006896
Ye, D., Zhang, G.: On-line scheduling of parallel jobs in a list. J. Sched. 10(6), 407–413 (2007)
Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020). https://doi.org/10.1109/SC41405.2020.00035
Zhuk, S.: Approximate algorithms to pack rectangles into several strips. Discrete Math. Appl. 16(1), 73–85 (2006)
Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parallel Distrib. Comput. 164, 83–95 (2022). https://doi.org/10.1016/j.jpdc.2022.01.003. https://www.sciencedirect.com/science/article/pii/S0743731522000090
Acknowledgement
This research was supported by the EuroHPC EU Regale project (g.a. 956560), São Paulo Research Foundation (FAPESP, grants 19/26702-8 and 22/06906-0), and the MIAI Grenoble-Alpes institute (ANR project number 19-P3IA-0003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rosa, L., Carastan-Santos, D., Goldman, A. (2023). An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-43943-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43942-1
Online ISBN: 978-3-031-43943-8
eBook Packages: Computer ScienceComputer Science (R0)