Skip to main content
Log in

Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Cloud computing is widely used in various fields, which can provide sufficient computing resources to address users’ demands (workflows) quickly and effectively. However, resource failure is inevitable, and a challenge to optimize the workflow scheduling is to consider the fault tolerance. Most of previous algorithms are based on failure prediction and fault-tolerant strategies, which can cause the time delay and waste of resources. In this paper, combining the above two methods through a deep reinforcement learning framework, an adaptive fault-tolerant workflow scheduling framework called RLFTWS is proposed, aiming to minimize the makespan and resource usage rate. In this framework, the fault-tolerant workflow scheduling is formulated as a markov decision process. Resubmission and replication strategy are as two actions. A heuristic algorithm is designed for the task allocation and execution according to the selected fault-tolerant strategy. And, double deep Q network framework (DDQN) is developed to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment. Simulation results show that the proposed RLFTWS can efficiently balance the makespan and resource usage rate, and achieve fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Arwa M, Mosab H, Suleman K, Ahmed A, Sharief FB, Muhammad I, Marsono MN (2021) Software-defined networks for resource allocation in cloud computing: a survey. Comput Netw 195:1389–1286

    Google Scholar 

  2. Zhang Y, Zhou Y, Lu H, Fujita H (2021) Spark Cloud-based parallel computing for traffic network flow predictive control using non-analytical predictive model. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3071862

  3. Xu JL, Zhang ZX, Hu ZM, Du L, Cai XJ (2021) A many-objective optimized task allocation scheduling model in cloud computing. Appl Intell 51:3293–3310

    Article  Google Scholar 

  4. Tang XY, Shi WQ, Wu F (2020) Interconnection network energy-aware workflow scheduling algorithm on heterogeneous systems. IEEE Trans Ind Inf 16(12):7637–7645

    Article  Google Scholar 

  5. Jyoti S, Deo PV (2018) A cost-effective deadline-constrained dynamic scheduling algorithm for scientific workflows in a cloud environment. IEEE Trans Cloud Comput 6(1):2–18

    Article  Google Scholar 

  6. Bhaskar PR, Martin M (2017) Workflow scheduling in multi-tenant cloud computing environments. IEEE Trans Parallel Distrib Syst 28(1):290–304

    Article  Google Scholar 

  7. Koneti KC, Shyamala L, Vaidehi V (2021) Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl Intell 51:1629–1644

    Article  Google Scholar 

  8. Sukhpal SG, Rajkumar B (2020) Failure management for reliable cloud computing: a taxonomy, model, and future directions. Comput Sci Eng 22(3):52–63

    Article  Google Scholar 

  9. Fan G, Chen L, Yu H, Liu D (2020) Modeling and analyzing dynamic fault-tolerant strategy for deadline constrained task scheduling in cloud computing. IEEE Trans Syst Man Cybern Syst 50(4):1260–1274

    Article  Google Scholar 

  10. Mukwevho MA, Celik T (2021) Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605

    Article  Google Scholar 

  11. Hu B, Cao Z (2020) Minimizing resource consumption cost of DAG applications with reliability requirement on heterogeneous processor systems. IEEE Trans Ind Inform 16(12):7437– 7447

    Article  Google Scholar 

  12. Andrea R, Lydia YC, Walter B (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998

    Article  Google Scholar 

  13. Xie G, Zeng G, Li R, Li K (2020) Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans Cloud Comput 8(4):1223–1236

    Article  Google Scholar 

  14. Liu J, Wang S, Zhou A, Kumar SAP, Yang F, Buyya R (2018) Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans Cloud Comput 6(4):1191–1202

    Article  Google Scholar 

  15. Liu J, Wei MX, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Archit 90:23–33

    Article  Google Scholar 

  16. Yao GS, Ding YS, Ren LH, Hao KG, Chen L (2016) An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl-Based Syst 99:39–50

    Article  Google Scholar 

  17. Yao GS, Ding YS, Hao KG (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–3683

    Article  Google Scholar 

  18. Yan H, Zhu XM, Chen HK, Guo H, Zhou W, Bao WD (2019) DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf Sci 477:30–46

    Article  MATH  Google Scholar 

  19. Li X, Jiang X, Garraghan P, Wu Z (2018) Holistic energy and failure aware workload scheduling in Cloud datacenters. Futur Gener Comput Syst 78(3):887–900

    Article  Google Scholar 

  20. Kaitovic I, Malek M (2020) Impact of failure prediction on availability: modeling and comparative analysis of predictive and reactive methods. IEEE Trans Dependable Secure Comput 17(3):493–505

    Google Scholar 

  21. Soualhia M, Khomh F, Tahar S (2020) A dynamic and failure-aware task scheduling framework for hadoop. IEEE Trans Cloud Comput 8(2):553–569

    Article  Google Scholar 

  22. Yoshua B, Andrea L, Antoine P (2021) Machine learning for combinatorial optimizaton: a methodological tour d’horizon. Eur J Oper Res 290(2):405–421

    Article  MATH  Google Scholar 

  23. Maryam KM, Mehrdad M, Patrick M, Amir KM, El-Ghazali T Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: a state-of-the-art. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2021.04.032

  24. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press

  25. Volodymyr M, Koray K, David S et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–33

    Article  Google Scholar 

  26. Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208

    Article  Google Scholar 

  27. Zhang Y (2018) Resource scheduling and delay analysis for workflow in wireless small cloud. IEEE Trans Mob Comput 17(3):675–687

    Article  Google Scholar 

  28. Topcuoglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274

    Article  Google Scholar 

  29. Hasselt HV, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI.2, pp 5–17

  30. Chen WW, da S, Rafael F, Deelman E, Fahringer T (2016) Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans Cloud Comput 4(1):49–62

    Article  Google Scholar 

  31. Shang M, Zhou Y, Fujita H (2020) Deep reinforcement learning with reference system to handle constraints for energy-efficient train control. Inf Sci 570:708–721

    Article  MathSciNet  Google Scholar 

  32. Zhang Y, Zhou Y, Lu H, Fujita H (2021) Cooperative multi-agent actor–critic control of traffic network flow based on edge computing. Futur Gener Comput Syst 123:128–141

    Article  Google Scholar 

  33. Tong Z, Chen H, Deng X, Li K, Li K (2020) A scheduling scheme in the cloud computing environment using deep Q-learning. Inf Sci 512:1170–1191

    Article  Google Scholar 

  34. Zhu JW, Gu CY, Ding SX, Zhang WA, Wang X, Yu L (2021) A new observer-based cooperative fault-tolerant tracking control method with application to networked multiaxis motion control system. IEEE Trans Ind Electron 68(8):7422–7432

    Article  Google Scholar 

  35. Kintsakis AM, Psomopoulos FE, Mitkas PA (2019) Reinforcement learning based scheduling in a workflow management system. Eng Appl Artif Intell 81:94–106

    Article  Google Scholar 

Download references

Acknowledgements

This paper is supported by Research on Intelligent inventory optimization decision driven by data (2021XJKY01), Humanity and Social Science Research of Ministry of Education (20YJCZH200), Beijing Intelligent Logistics System Collaborative Innovation Center Open Topic(No.BILSCIC-2019KF-05), Grass-roots Academic Team Building Project of Beijing Wuzi University (No.2019XJJCTD04), the key project of Beijing Social Science Foundation strategic research on improving the service quality of capital logistics based on big data technology (18GLA009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tingting Dong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, T., Xue, F., Tang, H. et al. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl Intell 53, 9916–9932 (2023). https://doi.org/10.1007/s10489-022-03963-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03963-w

Keywords

Navigation