Skip to main content
Log in

Task replication to improve the reliability of running workflows on the cloud

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Clouds are heterogeneous service-oriented systems which are increasingly considered as platforms of choice for scientific workflow applications. Because resource and communication failures are inevitable in large complex distributed systems, insuring the reliability of heterogeneous service-oriented systems poses a major challenge. As it affects the quality of user service requirements, reliability has become an important criterion in workflow scheduling. Replication-based fault-tolerance is one approach for satisfying the requirements set to safeguard the reliability of an application. In order to minimize the workflow execution cost while respecting the user-defined deadline and reliability, the present paper proposes Improving CbCP with Replication (ICR) which includes three algorithms: the Scheduling, the Fix Up, and the Task Replication. The Scheduling employs the CbCP algorithm, where CbCP stands for Clustering based on Critical Parent and it is a previously developed algorithm by the same authors, to generate a schedule map of the workflow. The Fix Up algorithm checks the possibility of starting each task earlier in the leased resource without imposing any extra cost. The Task Replication algorithm utilizes the rest of the idle time slots in leased resources to replicate tasks. Experimental results from real and randomly generated applications at different scales demonstrate that the proposed heuristic, for the majority of studied scenarios, increases the execution reliability of workflows while reducing the workflows execution costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Calheiros, R.N., Buyya, R., Member, S.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans. Parallel Distrib. Syst. 25, 1787–1796 (2013)

    Article  Google Scholar 

  2. Cai, Z., Li, X., Gupta, J.N.D.: Heuristics for provisioning services to workflows in XaaS clouds. IEEE Trans. Serv. Comput. 92, 250–263 (2016)

    Article  Google Scholar 

  3. Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.: Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27(12), 3501–3517 (2016)

    Article  Google Scholar 

  4. Zhou, A.: Cloud service reliability enhancement via virtual machine placement optimization. IEEE Trans. Serv. Comput. 10(6), 902–913 (2016)

    Article  Google Scholar 

  5. Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)

    Article  MathSciNet  Google Scholar 

  6. Qiu, W., Zheng, Z., Wang, X., Yang, X., Lyu, M.R.: Reliability-based design optimization for cloud migration. IEEE Trans. Serv. Comput. 7(2), 223–236 (2014)

    Article  Google Scholar 

  7. Silic, M., Delac, G., Srbljic, S.: Prediction of atomic web services reliability for QoS-aware recommendation. IEEE Trans. Serv. Comput. 8(3), 425–438 (2015)

    Article  Google Scholar 

  8. Bajaj, R., Agrawal, D.P.: Improving scheduling of tasks in a heterogeneous environment. IEEE Trans. Parallel Distrib. Syst. 15(2), 107–118 (2004)

    Article  Google Scholar 

  9. Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)

    Article  Google Scholar 

  10. Wieczorek, M., Hoheisel, A., Prodan, R.: Towards a general model of the multi-criteria workflow scheduling on the grid. Futur. Gener. Comput. Syst. 25, 237–256 (2009)

    Article  Google Scholar 

  11. Yu, J., Kirley, M., Buyya, R.: Multi-objective planning for workflow execution on Grids. In: Proceedings on IEEE/ACM Int. Work. Grid Comput., pp. 10–17 (2007)

  12. Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proc. Ninet. Annu. ACM Symp. Parallel algorithms Archit.—SPAA ’07, p. 280 (2007)

  13. Swaminathan, S., Manimaran, G.: A reliability-aware value-based scheduler for dynamic multiprocessor real-time systems. In: Proceedings on Int. Parallel Distrib. Process. Symp. IPDPS 2002, no. December, p. 98 (2002)

  14. Benoit A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IPDPS Miami 2008—Proc. 22nd IEEE Int. Parallel Distrib. Process. Symp. Progr. CD-ROM, vol. 33, no. December 2007 (2008)

  15. Benoit, A., Hakem, M., Robert, Y.: Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 35(2), 83–108 (2009)

    Article  Google Scholar 

  16. Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. Comput. 64, 241–254 (2009)

    Article  Google Scholar 

  17. Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)

    Article  Google Scholar 

  18. Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)

    Article  MathSciNet  Google Scholar 

  19. Mousavi Nik, S.S., Naghibzadeh, M., Sedaghat, Y.: Cost-driven workflow scheduling on the cloud with deadline and reliability constraints. Computing 102(2), 477–500 (2020)

    Article  MathSciNet  Google Scholar 

  20. Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 12(4), 665–679 (2014)

    Article  Google Scholar 

  21. Sakellariou, R., Zhao, H., Tsiakkouri, E., Dikaiakos, M.D.: Scheduling workflows with budget constraints. In: Integr. Res. GRID Comput. CoreGRID Integr. Work. 2005 Sel. Pap., pp. 189–202 (2007)

  22. Su, S., Li, J., Huang, Q., Huang, X., Shuang, K., Wang, J.: Cost-efficient task scheduling for executing large programs in the cloud. Parallel Comput. 39(4–5), 177–188 (2013)

    Article  Google Scholar 

  23. Szabo, C., Kroeger, T.: Evolving multi-objective strategies for task allocation of scientific workflows on public clouds. IEEE Congr Evol. Comput. CEC 2012, 10–15 (2012)

    Google Scholar 

  24. Kianpisheh, S., Charkari, N.M.: A grid workflow Quality-of-Service estimation based on resource availability prediction. J. Supercomput. 67(2), 496–527 (2014)

    Article  Google Scholar 

  25. Xie, G., et al.: Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans. Serv. Comput. (2017)

  26. He, Y., Shao, Z., Xiao, B., Zhuge, Q., Sha, E.: Reliability driven task scheduling for heterogeneous systems. Int. Conf. Parallel Distrub. Comput. Syst. (2003)

  27. Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. Parallel Process. 2002. In: Proceedings. Int. Conf., no. July, pp. 360–368 (2002)

  28. Benoit, A., Hakem, M., Robert, Y.: Optimizing the latency of streaming applications under throughput and reliability constraint. In: Proc. Int. Conf. Parallel Process., pp. 325–332 (2009)

  29. Zhao, L., Ren, Y., Sakurai, K.: A resource minimizing scheduling algorithm with ensuring the deadline and reliability in heterogeneous systems. In: Proc. - Int. Conf. Adv. Inf. Netw. Appl. AINA, pp. 275–282 (2011).

  30. Xie, G., Zeng, G., Li, R., Member, S.: Quantitative fault-tolerance for reliable workflows on Heterogeneous IaaS clouds. IEEE Trans. Cloud Comput. (2017)

  31. Naghibzadeh, M.: Modeling and scheduling hybrid workflows of tasks and task interaction graphs on the cloud. Futur. Gener. Comput. Syst. 65, 33–45 (2016)

    Article  Google Scholar 

  32. Benoit, A., Canon, L.C., Jeannot, E., Robert, Y.: Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J. Sched. 15(5), 615–627 (2012)

    Article  MathSciNet  Google Scholar 

  33. Topcuoglu, H., Hariri, S.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13, 260–274 (2002)

    Article  Google Scholar 

  34. Ranaweera, S., Agrawal, D.P.: A task duplication based scheduling algorithm for heterogeneous systems. Parallel Distrib. Process. Symp. 2000. IPDPS 2000. In: Proceedings. 14th Int., pp. 445–450 (2000)

  35. Bharathi, S., Chervenak, A., Deelmn, E., Mehta, G., Su, M.H., Vahi, K.: Characterization of scientific workflows. In: 2008 3rd Work. Work. Support Large-Scale Sci. Work. 2008, no. June 2014, (2008)

Download references

Acknowledgements

The authors would like to express their gssratitude to the anonymous reviewers for their constructive comments which have helped to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahmoud Naghibzadeh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mousavi Nik, S.S., Naghibzadeh, M. & Sedaghat, Y. Task replication to improve the reliability of running workflows on the cloud. Cluster Comput 24, 343–359 (2021). https://doi.org/10.1007/s10586-020-03109-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-020-03109-y

Keywords

Navigation