Abstract
Cloud computing is widely used in various fields, which can provide sufficient computing resources to address users’ demands (workflows) quickly and effectively. However, resource failure is inevitable, and a challenge to optimize the workflow scheduling is to consider the fault tolerance. Most of previous algorithms are based on failure prediction and fault-tolerant strategies, which can cause the time delay and waste of resources. In this paper, combining the above two methods through a deep reinforcement learning framework, an adaptive fault-tolerant workflow scheduling framework called RLFTWS is proposed, aiming to minimize the makespan and resource usage rate. In this framework, the fault-tolerant workflow scheduling is formulated as a markov decision process. Resubmission and replication strategy are as two actions. A heuristic algorithm is designed for the task allocation and execution according to the selected fault-tolerant strategy. And, double deep Q network framework (DDQN) is developed to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment. Simulation results show that the proposed RLFTWS can efficiently balance the makespan and resource usage rate, and achieve fault tolerance.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arwa M, Mosab H, Suleman K, Ahmed A, Sharief FB, Muhammad I, Marsono MN (2021) Software-defined networks for resource allocation in cloud computing: a survey. Comput Netw 195:1389–1286
Zhang Y, Zhou Y, Lu H, Fujita H (2021) Spark Cloud-based parallel computing for traffic network flow predictive control using non-analytical predictive model. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3071862
Xu JL, Zhang ZX, Hu ZM, Du L, Cai XJ (2021) A many-objective optimized task allocation scheduling model in cloud computing. Appl Intell 51:3293–3310
Tang XY, Shi WQ, Wu F (2020) Interconnection network energy-aware workflow scheduling algorithm on heterogeneous systems. IEEE Trans Ind Inf 16(12):7637–7645
Jyoti S, Deo PV (2018) A cost-effective deadline-constrained dynamic scheduling algorithm for scientific workflows in a cloud environment. IEEE Trans Cloud Comput 6(1):2–18
Bhaskar PR, Martin M (2017) Workflow scheduling in multi-tenant cloud computing environments. IEEE Trans Parallel Distrib Syst 28(1):290–304
Koneti KC, Shyamala L, Vaidehi V (2021) Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl Intell 51:1629–1644
Sukhpal SG, Rajkumar B (2020) Failure management for reliable cloud computing: a taxonomy, model, and future directions. Comput Sci Eng 22(3):52–63
Fan G, Chen L, Yu H, Liu D (2020) Modeling and analyzing dynamic fault-tolerant strategy for deadline constrained task scheduling in cloud computing. IEEE Trans Syst Man Cybern Syst 50(4):1260–1274
Mukwevho MA, Celik T (2021) Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605
Hu B, Cao Z (2020) Minimizing resource consumption cost of DAG applications with reliability requirement on heterogeneous processor systems. IEEE Trans Ind Inform 16(12):7437– 7447
Andrea R, Lydia YC, Walter B (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998
Xie G, Zeng G, Li R, Li K (2020) Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans Cloud Comput 8(4):1223–1236
Liu J, Wang S, Zhou A, Kumar SAP, Yang F, Buyya R (2018) Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans Cloud Comput 6(4):1191–1202
Liu J, Wei MX, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Archit 90:23–33
Yao GS, Ding YS, Ren LH, Hao KG, Chen L (2016) An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl-Based Syst 99:39–50
Yao GS, Ding YS, Hao KG (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–3683
Yan H, Zhu XM, Chen HK, Guo H, Zhou W, Bao WD (2019) DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf Sci 477:30–46
Li X, Jiang X, Garraghan P, Wu Z (2018) Holistic energy and failure aware workload scheduling in Cloud datacenters. Futur Gener Comput Syst 78(3):887–900
Kaitovic I, Malek M (2020) Impact of failure prediction on availability: modeling and comparative analysis of predictive and reactive methods. IEEE Trans Dependable Secure Comput 17(3):493–505
Soualhia M, Khomh F, Tahar S (2020) A dynamic and failure-aware task scheduling framework for hadoop. IEEE Trans Cloud Comput 8(2):553–569
Yoshua B, Andrea L, Antoine P (2021) Machine learning for combinatorial optimizaton: a methodological tour d’horizon. Eur J Oper Res 290(2):405–421
Maryam KM, Mehrdad M, Patrick M, Amir KM, El-Ghazali T Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: a state-of-the-art. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2021.04.032
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press
Volodymyr M, Koray K, David S et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–33
Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208
Zhang Y (2018) Resource scheduling and delay analysis for workflow in wireless small cloud. IEEE Trans Mob Comput 17(3):675–687
Topcuoglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274
Hasselt HV, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI.2, pp 5–17
Chen WW, da S, Rafael F, Deelman E, Fahringer T (2016) Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans Cloud Comput 4(1):49–62
Shang M, Zhou Y, Fujita H (2020) Deep reinforcement learning with reference system to handle constraints for energy-efficient train control. Inf Sci 570:708–721
Zhang Y, Zhou Y, Lu H, Fujita H (2021) Cooperative multi-agent actor–critic control of traffic network flow based on edge computing. Futur Gener Comput Syst 123:128–141
Tong Z, Chen H, Deng X, Li K, Li K (2020) A scheduling scheme in the cloud computing environment using deep Q-learning. Inf Sci 512:1170–1191
Zhu JW, Gu CY, Ding SX, Zhang WA, Wang X, Yu L (2021) A new observer-based cooperative fault-tolerant tracking control method with application to networked multiaxis motion control system. IEEE Trans Ind Electron 68(8):7422–7432
Kintsakis AM, Psomopoulos FE, Mitkas PA (2019) Reinforcement learning based scheduling in a workflow management system. Eng Appl Artif Intell 81:94–106
Acknowledgements
This paper is supported by Research on Intelligent inventory optimization decision driven by data (2021XJKY01), Humanity and Social Science Research of Ministry of Education (20YJCZH200), Beijing Intelligent Logistics System Collaborative Innovation Center Open Topic(No.BILSCIC-2019KF-05), Grass-roots Academic Team Building Project of Beijing Wuzi University (No.2019XJJCTD04), the key project of Beijing Social Science Foundation strategic research on improving the service quality of capital logistics based on big data technology (18GLA009).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dong, T., Xue, F., Tang, H. et al. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl Intell 53, 9916–9932 (2023). https://doi.org/10.1007/s10489-022-03963-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03963-w