Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

Dong, Tingting; Xue, Fei; Tang, Hengliang; Xiao, Chuangbai

doi:10.1007/s10489-022-03963-w

Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

Published: 13 August 2022

Volume 53, pages 9916–9932, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Tingting Dong ORCID: orcid.org/0000-0002-3436-6356^1,2,
Fei Xue¹,
Hengliang Tang¹ &
…
Chuangbai Xiao²

1080 Accesses
1 Altmetric
Explore all metrics

Abstract

Cloud computing is widely used in various fields, which can provide sufficient computing resources to address users’ demands (workflows) quickly and effectively. However, resource failure is inevitable, and a challenge to optimize the workflow scheduling is to consider the fault tolerance. Most of previous algorithms are based on failure prediction and fault-tolerant strategies, which can cause the time delay and waste of resources. In this paper, combining the above two methods through a deep reinforcement learning framework, an adaptive fault-tolerant workflow scheduling framework called RLFTWS is proposed, aiming to minimize the makespan and resource usage rate. In this framework, the fault-tolerant workflow scheduling is formulated as a markov decision process. Resubmission and replication strategy are as two actions. A heuristic algorithm is designed for the task allocation and execution according to the selected fault-tolerant strategy. And, double deep Q network framework (DDQN) is developed to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment. Simulation results show that the proposed RLFTWS can efficiently balance the makespan and resource usage rate, and achieve fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reactive Workflow Scheduling in Fluctuant Infrastructure-as-a-Service Clouds Using Deep Reinforcement Learning

Deep reinforcement learning-based scheduling in distributed systems: a critical review

Article 26 June 2024

Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Arwa M, Mosab H, Suleman K, Ahmed A, Sharief FB, Muhammad I, Marsono MN (2021) Software-defined networks for resource allocation in cloud computing: a survey. Comput Netw 195:1389–1286
Google Scholar
Zhang Y, Zhou Y, Lu H, Fujita H (2021) Spark Cloud-based parallel computing for traffic network flow predictive control using non-analytical predictive model. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2021.3071862
Xu JL, Zhang ZX, Hu ZM, Du L, Cai XJ (2021) A many-objective optimized task allocation scheduling model in cloud computing. Appl Intell 51:3293–3310
Article Google Scholar
Tang XY, Shi WQ, Wu F (2020) Interconnection network energy-aware workflow scheduling algorithm on heterogeneous systems. IEEE Trans Ind Inf 16(12):7637–7645
Article Google Scholar
Jyoti S, Deo PV (2018) A cost-effective deadline-constrained dynamic scheduling algorithm for scientific workflows in a cloud environment. IEEE Trans Cloud Comput 6(1):2–18
Article Google Scholar
Bhaskar PR, Martin M (2017) Workflow scheduling in multi-tenant cloud computing environments. IEEE Trans Parallel Distrib Syst 28(1):290–304
Article Google Scholar
Koneti KC, Shyamala L, Vaidehi V (2021) Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl Intell 51:1629–1644
Article Google Scholar
Sukhpal SG, Rajkumar B (2020) Failure management for reliable cloud computing: a taxonomy, model, and future directions. Comput Sci Eng 22(3):52–63
Article Google Scholar
Fan G, Chen L, Yu H, Liu D (2020) Modeling and analyzing dynamic fault-tolerant strategy for deadline constrained task scheduling in cloud computing. IEEE Trans Syst Man Cybern Syst 50(4):1260–1274
Article Google Scholar
Mukwevho MA, Celik T (2021) Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans Serv Comput 14(2):589–605
Article Google Scholar
Hu B, Cao Z (2020) Minimizing resource consumption cost of DAG applications with reliability requirement on heterogeneous processor systems. IEEE Trans Ind Inform 16(12):7437– 7447
Article Google Scholar
Andrea R, Lydia YC, Walter B (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998
Article Google Scholar
Xie G, Zeng G, Li R, Li K (2020) Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans Cloud Comput 8(4):1223–1236
Article Google Scholar
Liu J, Wang S, Zhou A, Kumar SAP, Yang F, Buyya R (2018) Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans Cloud Comput 6(4):1191–1202
Article Google Scholar
Liu J, Wei MX, Hu W, Xu X, Ouyang A (2018) Task scheduling with fault-tolerance in real-time heterogeneous systems. J Syst Archit 90:23–33
Article Google Scholar
Yao GS, Ding YS, Ren LH, Hao KG, Chen L (2016) An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl-Based Syst 99:39–50
Article Google Scholar
Yao GS, Ding YS, Hao KG (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–3683
Article Google Scholar
Yan H, Zhu XM, Chen HK, Guo H, Zhou W, Bao WD (2019) DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf Sci 477:30–46
Article MATH Google Scholar
Li X, Jiang X, Garraghan P, Wu Z (2018) Holistic energy and failure aware workload scheduling in Cloud datacenters. Futur Gener Comput Syst 78(3):887–900
Article Google Scholar
Kaitovic I, Malek M (2020) Impact of failure prediction on availability: modeling and comparative analysis of predictive and reactive methods. IEEE Trans Dependable Secure Comput 17(3):493–505
Google Scholar
Soualhia M, Khomh F, Tahar S (2020) A dynamic and failure-aware task scheduling framework for hadoop. IEEE Trans Cloud Comput 8(2):553–569
Article Google Scholar
Yoshua B, Andrea L, Antoine P (2021) Machine learning for combinatorial optimizaton: a methodological tour d’horizon. Eur J Oper Res 290(2):405–421
Article MATH Google Scholar
Maryam KM, Mehrdad M, Patrick M, Amir KM, El-Ghazali T Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: a state-of-the-art. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2021.04.032
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press
Volodymyr M, Koray K, David S et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–33
Article Google Scholar
Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208
Article Google Scholar
Zhang Y (2018) Resource scheduling and delay analysis for workflow in wireless small cloud. IEEE Trans Mob Comput 17(3):675–687
Article Google Scholar
Topcuoglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274
Article Google Scholar
Hasselt HV, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI.2, pp 5–17
Chen WW, da S, Rafael F, Deelman E, Fahringer T (2016) Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans Cloud Comput 4(1):49–62
Article Google Scholar
Shang M, Zhou Y, Fujita H (2020) Deep reinforcement learning with reference system to handle constraints for energy-efficient train control. Inf Sci 570:708–721
Article MathSciNet Google Scholar
Zhang Y, Zhou Y, Lu H, Fujita H (2021) Cooperative multi-agent actor–critic control of traffic network flow based on edge computing. Futur Gener Comput Syst 123:128–141
Article Google Scholar
Tong Z, Chen H, Deng X, Li K, Li K (2020) A scheduling scheme in the cloud computing environment using deep Q-learning. Inf Sci 512:1170–1191
Article Google Scholar
Zhu JW, Gu CY, Ding SX, Zhang WA, Wang X, Yu L (2021) A new observer-based cooperative fault-tolerant tracking control method with application to networked multiaxis motion control system. IEEE Trans Ind Electron 68(8):7422–7432
Article Google Scholar
Kintsakis AM, Psomopoulos FE, Mitkas PA (2019) Reinforcement learning based scheduling in a workflow management system. Eng Appl Artif Intell 81:94–106
Article Google Scholar

Download references

Acknowledgements

This paper is supported by Research on Intelligent inventory optimization decision driven by data (2021XJKY01), Humanity and Social Science Research of Ministry of Education (20YJCZH200), Beijing Intelligent Logistics System Collaborative Innovation Center Open Topic(No.BILSCIC-2019KF-05), Grass-roots Academic Team Building Project of Beijing Wuzi University (No.2019XJJCTD04), the key project of Beijing Social Science Foundation strategic research on improving the service quality of capital logistics based on big data technology (18GLA009).

Author information

Authors and Affiliations

Beijing Wuzi University, Beijing, China
Tingting Dong, Fei Xue & Hengliang Tang
Beijing University of Technology, Beijing, China
Tingting Dong & Chuangbai Xiao

Authors

Tingting Dong
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Hengliang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Chuangbai Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tingting Dong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dong, T., Xue, F., Tang, H. et al. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl Intell 53, 9916–9932 (2023). https://doi.org/10.1007/s10489-022-03963-w

Download citation

Accepted: 04 July 2022
Published: 13 August 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10489-022-03963-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reactive Workflow Scheduling in Fluctuant Infrastructure-as-a-Service Clouds Using Deep Reinforcement Learning

Deep reinforcement learning-based scheduling in distributed systems: a critical review

Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reactive Workflow Scheduling in Fluctuant Infrastructure-as-a-Service Clouds Using Deep Reinforcement Learning

Deep reinforcement learning-based scheduling in distributed systems: a critical review

Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation