skip to main content
10.1145/3626203.3670610acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
short-paper
Open access

Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in Parallel

Published: 17 July 2024 Publication History

Abstract

High-performance computing (HPC) clusters are widely used as a platform for scientific and engineering research as well as a broad range of data analysis tasks. Demand for HPC resources continues to grow, necessitating more scalable systems and improved management of cluster resources. Job scheduling algorithms are key components of managing the allocation of cluster resources. A common algorithm that is used in many production systems is backfilling, which provides an efficient and feasible approach to scheduling. Many variations of backfilling have been created and studied which aim to improve its performance, but there are still opportunities in this field. In this paper, we propose a new approach named Parallel backfilling which improves scheduling throughput without increasing execution time in production environments. Our concept is to allow for multiple backfill "workers" to process the waiting job queue in parallel, increasing the rate of scheduled jobs and thus improving system turnaround time for users. We present simulated results based on job traces from the Beocat HPC cluster at Kansas State University that show significant improvement in average job wait times and scheduler throughput. We conclude that Parallel backfill provides better performance than traditional backfill and some of its variants, and compare our results with a selection of scheduling optimizations.

References

[1]
Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram. 2015. Improving Backfilling by using Machine Learning to predict Running Times. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’15). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/2807591.2807646
[2]
Wolfgang Gentzsch. 2001. Sun Grid Engine: Towards Creating a Compute Power Grid. In Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, Brisbane, QLD, Australia, 35–36. https://doi.org/10.1109/CCGRID.2001.923173
[3]
César Gómez-Martín, Miguel A. Vega-Rodríguez, and José-Luis González-Sánchez. 2016. Fattened backfilling: An improved strategy for job scheduling in parallel systems. J. Parallel and Distrib. Comput. 97 (Nov. 2016), 69–77. https://doi.org/10.1016/j.jpdc.2016.06.013
[4]
Syed Munir Hussain Shah, Kalim Qureshi, and Haroon Rasheed. 2010. Optimal job packing, a backfill scheduling optimization for a cluster of workstations. The Journal of Supercomputing 54, 3 (Dec. 2010), 381–399. https://doi.org/10.1007/s11227-009-0332-3
[5]
David Jackson, Quinn Snell, and Mark Clement. 2001. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing(Lecture Notes in Computer Science), Dror G. Feitelson and Larry Rudolph (Eds.). Springer, Berlin, Heidelberg, 87–102. https://doi.org/10.1007/3-540-45540-X_6
[6]
Thanh Hoang Le Hai, Khang Nguyen Duy, Thin Nguyen Manh, Danh Mai Hoang, and Nam Thoai. 2023. Deviation Backfilling: A Robust Backfilling Scheme for Improving the Efficiency of Job Scheduling on High Performance Computing Systems. In 2023 International Conference on Advanced Computing and Analytics (ACOMPA). IEEE, Da Nang City, Vietnam, 32–37. https://doi.org/10.1109/ACOMPA61072.2023.00015
[7]
Sergei Leonenkov and Sergey Zhumatiy. 2015. Introducing New Backfill-based Scheduler for SLURM Resource Manager. Procedia Computer Science 66 (Jan. 2015), 661–669. https://doi.org/10.1016/j.procs.2015.11.075
[8]
Kevin Menear, Ambarish Nag, Jordan Perr-Sauer, Monte Lunacek, Kristi Potter, and Dmitry Duplyakin. 2023. Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach. In Practice and Experience in Advanced Research Computing(PEARC ’23). Association for Computing Machinery, New York, NY, USA, 75–85. https://doi.org/10.1145/3569951.3593598
[9]
A.W. Mu’alem and D.G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (June 2001), 529–543. https://doi.org/10.1109/71.932708 Conference Name: IEEE Transactions on Parallel and Distributed Systems.
[10]
SchedMD. 2024. Slurm Workload Manager - Documentation. https://slurm.schedmd.com/
[11]
Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation(Lecture Notes in Computer Science), Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer International Publishing, Cham, 197–217. https://doi.org/10.1007/978-3-319-72971-8_10
[12]
Mohammed Tanash, Brandon Dunn, Daniel Andresen, William Hsu, Huichen Yang, and Adedolapo Okanlawon. 2019. Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning)(PEARC ’19). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3332186.3333041

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing
July 2024
608 pages
ISBN:9798400704192
DOI:10.1145/3626203
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2024

Check for updates

Author Tags

  1. High-Performance Computing
  2. Performance
  3. Scheduling
  4. Slurm

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

PEARC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 173
    Total Downloads
  • Downloads (Last 12 months)173
  • Downloads (Last 6 weeks)41
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media