Loading [a11y]/accessibility-menu.js
Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning | IEEE Journals & Magazine | IEEE Xplore

Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning


Abstract:

In cloud computing, how to reasonably allocate computing resources for batch jobs to ensure the load balance of dynamic clusters and meet user requests is an important an...Show More

Abstract:

In cloud computing, how to reasonably allocate computing resources for batch jobs to ensure the load balance of dynamic clusters and meet user requests is an important and challenging task. Most existing studies are based on deep Q network, which utilizes neural networks to estimate the expected value of cumulative return in the scheduling process. The value-based DQN algorithms ignore the complete information contained in the value distribution and lack strong adaptability to time-varying batch jobs and dynamic cluster resources. Therefore, to capture the inherent stochasticity of the scheduling process caused by environmental stochasticity, we utilize Distributional Reinforcement Learning to model the value distribution of the cumulative return. Specifically, we formalize the load balancing scheduling as a multi-objective optimization problem and construct a Distributional Reinforcement Learning model. Then we introduce quantile regression to learn the value distribution of the cumulative return during scheduling and propose a dynamic load balancing scheduling algorithm based on Distributional Reinforcement Learning. In addition, we develop a cluster environment for real-time processing of batch jobs to simulate the arrival of batch jobs and train the Distributional Reinforcement Learning-based scheduling agent. We conduct empirical experiments and detailed analysis by using the real Alibaba Cluster cluster traces v2018 and v2020. The results show that compared to the baseline algorithms, the proposed algorithm performs better in terms of cluster load balancing, success rate of instance creation and average completion time of the tasks. The experimental results on different trace datasets also indicate that the propsoed algorithm exhibits excellent scalability.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 35, Issue: 1, January 2024)
Page(s): 169 - 185
Date of Publication: 20 November 2023

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.