Skip to main content

An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2023)

Abstract

Scheduling jobs in High-Performance Computing (HPC) platforms typically involves heuristics consisting of job sorting functions such as First-Come-First-Served or custom (hand-engineered). Linear regression methods are promising for exploiting scheduling data to create simple and transparent heuristics with lesser computational overhead than state-of-the-art learning methods. The drawback is lesser scheduling performance. We experimentally investigated the hypothesis that we could increase the scheduling performance of regression-obtained heuristics by increasing the complexity of the sorting functions and exploiting derivative job features. We used multiple linear regression to develop a factory of scheduling heuristics based on scheduling data. This factory uses general polynomials of the jobs’ characteristics as templates for the scheduling heuristics. We defined a set of polynomials with increasing complexity between them, and we used our factory to create scheduling heuristics based on these polynomials. We evaluated the performance of the obtained heuristics with wide-range simulation experiments using real-world traces from 1997 to 2016. Our results show that large-sized polynomials led to unstable scheduling heuristics due to multicollinearity effects in the regression, with small-sized polynomials leading to a stable and efficient scheduling performance. These results conclude that (i) multicollinearity imposes a constraint when one wants to derive new features (i.e., feature engineering) for creating scheduling heuristics with regression, and (ii) regression-obtained scheduling heuristics can be resilient to the long-term evolution of HPC platforms and workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/fredgrub/scheduling-simulator.

References

  1. Akinwande, M.O., Dikko, H.G., Samson, A.: Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis. Open J. Stat. 05, 754–767 (2015). https://doi.org/10.4236/ojs.2015.57075

    Article  Google Scholar 

  2. Alin, A.: Multicollinearity. Wiley Interdisc. Rev. Comput. Stat. 2, 370–374 (2010). https://doi.org/10.1002/wics.84

    Article  Google Scholar 

  3. Amvrosiadis, G., et al.: The atlas cluster trace repository. Usenix Mag. 43(4) (2018)

    Google Scholar 

  4. Baker, B.S., Coffman, E.G., Jr., Rivest, R.L.: Orthogonal packings in two dimensions. SIAM J. Comput. 9(4), 846–855 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bougeret, M., Dutot, P., Jansen, K., Otte, C., Trystram, D.: Approximation algorithms for multiple strip packing. In: Approximation and Online Algorithms, 7th International Workshop, WAOA 2009, Copenhagen, Denmark, September 10–11, 2009. Revised Papers, pp. 37–48 (2009). https://doi.org/10.1007/978-3-642-12450-1_4

  6. Carastan-Santos, D., de Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 32:1–32:13. SC 2017, ACM, New York (2017). https://doi.org/10.1145/3126908.3126955

  7. Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: a simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10 (2019). https://doi.org/10.1109/CCGRID.2019.00010

  8. Carroll, R., Ruppert, D.: Transformation and Weighting in Regression. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis (1988), https://books.google.com.br/books?id=I5rGEPJd57AC

  9. Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014)

    Article  Google Scholar 

  10. Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 807–816 (2021). https://doi.org/10.1109/IPDPS49936.2021.00090

  11. Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11

    Chapter  MATH  Google Scholar 

  12. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  13. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  14. Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 64:1–64:10. SC 2015, ACM, New York (2015). https://doi.org/10.1145/2807591.2807646

  15. Georgiou, Y.: Resource and job management in high performance computing, Ph. D. thesis, Joseph Fourier University, France (2010)

    Google Scholar 

  16. Hurink, J.L., Paulus, J.J.: Online algorithm for parallel job scheduling and strip packing. In: Kaklamanis, C., Skutella, M. (eds.) WAOA 2007. LNCS, vol. 4927, pp. 67–74. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77918-6_6

    Chapter  MATH  Google Scholar 

  17. Legrand, A., Trystram, D., Zrigui, S.: Adapting batch scheduling to workload characteristics: What can we expect from online learning? In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 686–695 (2019). https://doi.org/10.1109/IPDPS.2019.00077

  18. Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Klusáček, D., Cirne, W., Desai, N. (eds.) JSSPP 2017. LNCS, vol. 10773, pp. 43–61. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77398-8_3

    Chapter  Google Scholar 

  19. Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., Hu, C.: OKCM: improving parallel task scheduling in high-performance computing systems using online learning. J. Supercomput. 77(6), 5960–5983 (2021)

    Article  Google Scholar 

  20. Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003). https://doi.org/10.1016/S0743-7315(03)00108-4

    Article  MATH  Google Scholar 

  21. Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., Meuer, M.: TOP500 Supercomputer Sites (2023). https://www.top500.org/. Access 21 Feb 2023

  22. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  23. Pinedo, M.L.: Scheduling: Theory, Algorithms, and Systems. Springer (2016)

    Google Scholar 

  24. Rodrigo, G.P., Östberg, P.O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)

    Article  Google Scholar 

  25. Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on BlueGene/P systems. In: Cluster Computing and Workshops, 2009. CLUSTER 2009. IEEE International Conference on, pp. 1–10. IEEE (2009)

    Google Scholar 

  26. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2

    Article  Google Scholar 

  27. Ye, D., Han, X., Zhang, G.: Online multiple-strip packing. Theoret. Comput. Sci. 412(3), 233–239 (2011). https://doi.org/10.1016/j.tcs.2009.09.029. http://www.sciencedirect.com/science/article/pii/S0304397509006896

    Article  MathSciNet  MATH  Google Scholar 

  28. Ye, D., Zhang, G.: On-line scheduling of parallel jobs in a list. J. Sched. 10(6), 407–413 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  29. Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020). https://doi.org/10.1109/SC41405.2020.00035

  30. Zhuk, S.: Approximate algorithms to pack rectangles into several strips. Discrete Math. Appl. 16(1), 73–85 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  31. Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parallel Distrib. Comput. 164, 83–95 (2022). https://doi.org/10.1016/j.jpdc.2022.01.003. https://www.sciencedirect.com/science/article/pii/S0743731522000090

    Article  Google Scholar 

Download references

Acknowledgement

This research was supported by the EuroHPC EU Regale project (g.a. 956560), São Paulo Research Foundation (FAPESP, grants 19/26702-8 and 22/06906-0), and the MIAI Grenoble-Alpes institute (ANR project number 19-P3IA-0003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfredo Goldman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rosa, L., Carastan-Santos, D., Goldman, A. (2023). An Experimental Analysis of Regression-Obtained HPC Scheduling Heuristics. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43943-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43942-1

  • Online ISBN: 978-3-031-43943-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics