skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Conference ·
 [1];  [1];  [2]
  1. University of North Carolina at Charlotte
  2. ORNL

Improving the performance of job executions is an important goal of HPC batch job schedulers, such as minimizing job waiting time, slowdown, or completion time. Such a goal is often accomplished using carefully designed heuristics based on job features, such as job size and job duration. However, these heuristics overlook important runtime factors (e.g., cluster availability and waiting job patterns), which may vary across time and make a previously sound scheduling decision not hold any longer. In this study, we propose a new approach to incorporate runtime factors into batch job scheduling for better job execution performance. The key idea is to add a scheduling inspector on top of the base job scheduler to scrutinize its scheduling decisions. The inspector will take the runtime factors into consideration and accordingly determine the fitness of the scheduled job. It then either accepts the scheduled job or rejects it and asks the base schedulers to try again later. We realize such an inspector, namely SchedInspector, by leveraging the intelligence of reinforcement learning. Through extensive experiments, we show SchedInspector can intelligently integrate the runtime factors into various batch job scheduling policies, including the state-of-the-art one, to gain better job execution performance, such as smaller average bounded job slowdown (up to 69% better) or average job waiting time (up to 52% better), across various real-world workloads. We also show that although rejecting scheduling decisions may leave the resources idle hence affect the system utilization, SchedInspector is able to achieve the job execution performance improvement with marginal impact on the system utilization (typically less than 1%). We consider one key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batch job scheduling policies.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1885384
Resource Relation:
Conference: International Symposium on High-Performance Parallel and Distributed Computing (HPDC) - Minneapolis, Minnesota, United States of America - 6/27/2022 2:00:00 PM-7/1/2022 2:00:00 PM
Country of Publication:
United States
Language:
English

References (23)

Heuristics and augmented neural networks for task scheduling with non-identical machines journal November 2006
A review on evolution of production scheduling with neural networks journal August 2007
Power-aware linear programming based scheduling for heterogeneous computer clusters journal May 2012
Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers conference November 2020
Auto-association by multilayer perceptrons and singular value decomposition journal September 1988
Deep Reinforcement Agent for Scheduling in HPC conference May 2021
Experience with using the Parallel Workloads Archive journal October 2014
Mixed Integer Linear Programming in Process Scheduling: Modeling, Algorithms, and Applications journal October 2005
Work-Conserving Optimal Real-Time Scheduling on Multiprocessors conference July 2008
Reinforcement Learning: A Survey journal January 1996
Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning? conference May 2019
CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning
  • Li, Yan; Chang, Kenneth; Bel, Oceane
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126951
conference January 2017
Resource Management with Deep Reinforcement Learning conference November 2016
Learning scheduling algorithms for data processing clusters conference August 2019
Swift machine learning model serving scheduling conference November 2019
A QoS aware non-work-conserving disk scheduler conference April 2012
Fault-aware, utility-based job scheduling on Blue, Gene/P systems conference August 2009
NP-complete scheduling problems journal June 1975
A Deep Reinforcement Learning Scheduler with Back-filling for High Performance Computing conference December 2021
Computational models and heuristic methods for Grid scheduling problems journal April 2010
SLURM: Simple Linux Utility for Resource Management book January 2003
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling conference January 2010
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning conference November 2020

Similar Records

Related Subjects