skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

Journal Article · · IEEE Transactions on Parallel and Distributed Systems

Cluster schedulers are crucial in high-performance computing (HPC). They determine when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. An efficient training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by the system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. We implement DRAS into a HPC scheduling platform called CQGym. CQGym provides a common platform allowing users to flexibly evaluate DRAS and other scheduling methods such as heuristic and optimization methods. Furthermore, the experiments using CQGym with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 50%.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-06CH11357; AC02-05CH11231; CNS-1717763; CCF-2109316
OSTI ID:
1984484
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 12; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (22)

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates conference September 2017
Deep learning journal May 2015
Self-Optimizing Memory Controllers journal June 2008
System-wide trade-off modeling of performance, power, and resilience on petascale systems journal April 2018
Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems
  • Yang, Xu; Zhou, Zhou; Wallace, Sean
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503264
conference January 2013
Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling journal June 2001
DeepJS conference January 2019
The Effect of System Utilization on Application Performance Variability conference June 2019
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling journal May 2021
Mastering the game of Go without human knowledge journal October 2017
Resource Management with Deep Reinforcement Learning conference November 2016
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning conference November 2020
Energy-efficient and thermal-aware resource management for heterogeneous datacenters journal December 2014
Scheduling Beyond CPUs for HPC
  • Fan, Yuping; Lan, Zhiling; Rich, Paul
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325401
conference June 2019
Learning scheduling algorithms for data processing clusters conference August 2019
Multi-resource packing for cluster schedulers conference August 2014
Deep Reinforcement Learning framework for Autonomous Driving journal January 2017
Deep Reinforcement Agent for Scheduling in HPC conference May 2021
Residual Reinforcement Learning for Robot Control conference May 2019
Minimizing Electricity Cost: Optimization of Distributed Internet Data Centers in a Multi-Electricity-Market Environment conference March 2010
A Data Driven Scheduling Approach for Power Management on HPC Systems
  • Wallace, Sean; Yang, Xu; Vishwanath, Venkatram
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.55
conference November 2016
Function Optimization using Connectionist Reinforcement Learning Algorithms journal January 1991