Abstract
Reinforcement Learning (RL) algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often take tens of thousands of timesteps to explore the environment and adapt to a randomly initialized policy. Current RL schedulers overlook the importance of learning from pre-recorded datasets and improving upon existing customized heuristic policies. Data-driven RL (a.k.a batch RL) presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. We explore two data-driven RL methods: Behaviour Cloning and Offline RL, which aim to learn policies from pre-recorded data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods produce good results, we show that the performance is highly dependent on the quality of the pre-recorded datasets used during training. We demonstrate that by effectively incorporating prior expert demonstrations to pre-train the RL scheduling agent, we can short-circuit the random exploration phase to learn a reasonable policy with minimal online training. We utilize batch RL as a launchpad to learn effective scheduling policies from datasets collected using an Oracle or custom heuristic policies. We demonstrate that by combining offline pre-training and minimal online RL training, we can achieve comparable performance and reduce the training time by \(\sim 3x\) compared to the state-of-the-art online RL method. This framework is highly effective for pre-training using prior datasets with batch RL methods and well suited to continuous improvement using online learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amazon EC2 spot instances. Accessed May 2022. https://aws.amazon.com/ec2/spot/
Azure spot virtual machines. Accessed May 2022. https://azure.microsoft.com/en-us/pricing/spot/
Ambati, P., Bashir, N., Irwin, D., Shenoy, P.: Waiting game: optimally provisioning fixed resources for cloud-enabled schedulers. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2020). https://doi.org/10.1109/SC41405.2020.00071
Chen, R., Shi, H., Li, Y., Liu, X., Wang, G.: OLPart: online learning based resource partitioning for colocating multiple latency-critical jobs on commodity computers. In: Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys 2023. Association for Computing Machinery (2023)
Christodoulou, P.: Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019)
Delande, D., Stolf, P., Feraud, R., Pierson, J.M., Bottaro, A.: Horizontal scaling in cloud using contextual bandits. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021: Parallel Processing. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_18
Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka, M.E.: Hybrid workload scheduling on HPC systems. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society (2022)
Fan, Y., et al.: DRAS: deep reinforcement learning for cluster scheduling in high performance computing. IEEE Trans. Parallel Distrib. Syst. 33(12), 4903–4917 (2022)
Feitelson, D.G.: Resampling with feedback: a new paradigm of using workload data for performance evaluation. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 3–32. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88224-2_1
Florence, P., et al.: Implicit behavioral cloning. In: 5th Annual Conference on Robot Learning (2021)
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning (2021)
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning (2018)
Gabel, T., Lange, S., Riedmiller, M.: Batch Reinforcement Learning. Springer, Cham (2012)
Gao, Y., Chen, L., Li, B.: Spotlight: optimizing device placement for training deep neural networks. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, PMLR (2018)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor (2018)
Hahn, M., Chaplot, D.S., Tulsiani, S., Mukadam, M., Rehg, J.M., Gupta, A.: No RL, no simulation: learning to navigate without navigating. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021)
Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., van Hasselt, H.: Multi-task deep reinforcement learning with PopArt. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Hu, Z., Tu, J., Li, B.: Spear: optimized dependency-aware task scheduling with deep reinforcement learning. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) (2019)
Jain, T., Cooperman, G.: CRAC: checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020. IEEE Press (2020)
Jay, N., Rotman, N., Godfrey, B., Schapira, M., Tamar, A.: A deep reinforcement learning perspective on internet congestion control. In: Proceedings of the 36th International Conference on Machine Learning (2019)
Krishnakumar, A., et al.: Runtime task scheduling using imitation learning for heterogeneous many-core systems. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 39(11), 4064–4077 (2020)
Kumar, A., Fu, J., Soh, M., Tucker, G., Levine, S.: Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems (2019)
Levin, S.: Lecture notes. https://cs182sp21.github.io/static/slides/lec-14.pdf
Li, B., Basu Roy, R., Wang, D., Samsi, S., Gadepally, V., Tiwari, D.: Toward sustainable HPC: carbon footprint estimation and environmental implications of HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023. ACM (2023)
Mandal, S.K., Bhat, G., Patil, C.A., Doppa, J.R., Pande, P.P., Ogras, U.Y.: Dynamic resource management of heterogeneous mobile platforms via imitation learning. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 27(12), 2842–2854 (2019)
Mandlekar, A., et al.: What matters in learning from offline human demonstrations for robot manipulation (2021)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (2016)
Mao, H., et al.: Park: an open platform for learning-augmented computer systems. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2019)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research. PMLR (2016)
Nair, A., Dalal, M., Gupta, A., Levine, S.: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020)
Nair, A., Dalal, M., Gupta, A., Levine, S.: AWAC: accelerating online reinforcement learning with offline datasets (2021)
Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., Zaharia, M.: Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 481–498. USENIX Association, November 2020. https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak
Pomerleau, D.A.: Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 3(1), 88–97 (1991)
Sartor, A.L., Krishnakumar, A., Arda, S.E., Ogras, U.Y., Marculescu, R.: HiLITE: hierarchical and lightweight imitation learning for power management of embedded SoCs. IEEE Comput. Archit. Lett. 19(1), 63–67 (2020)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017)
Souza, A., Pelckmans, K., Tordsson, J.: A HPC co-scheduler with reinforcement learning. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds.) Job Scheduling Strategies for Parallel Processing. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88224-2_7
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Venkataswamy, V.: Scheduling to ensure performance and cost effectiveness in power-modulated datacenters (2023). https://doi.org/10.18130/efrq-c210
Venkataswamy, V., Grigsby, J., Grimshaw, A., Qi, Y.: RARE: renewable energy aware resource management in datacenters. In: Job Scheduling Strategies for Parallel Processing, JSSPP 2022 (2022)
Venkataswamy, V., Grimshaw, A.: Job scheduling in datacenters using constraint controlled RL. arXiv https://arxiv.org/abs/2211.05338 (2022)
Wang, Z., et al.: Critic regularized regression. In: Advances in Neural Information Processing Systems (2020)
Zhou, Y., et al.: Carbink: fault-tolerant far memory. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022). USENIX Association (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Venkataswamy, V., Grigsby, J., Grimshaw, A., Qi, Y. (2025). Launchpad: Learning to Schedule Using Offline and Online RL Methods. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2024. Lecture Notes in Computer Science, vol 14591. Springer, Cham. https://doi.org/10.1007/978-3-031-74430-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-74430-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74429-7
Online ISBN: 978-3-031-74430-3
eBook Packages: Computer ScienceComputer Science (R0)