Skip to main content

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In scheduling workflows in grid environment, concerns such as minimizing the makespan and cost, meeting the time and budget constraints and the possibility of resource failures and so on have motivated researchers to propose numerous scheduling algorithms. Several heuristics and meta-heuristic algorithms have been proposed to address these issues, each of which often only considers one or a few of these criteria. However, less attention has been paid to fault-tolerant scheduling of workflows. Adding fault-tolerance to a workflow scheduling algorithm leads to an inevitable increase in the makespan and cost. Using the resubmission technique may result to an unacceptable increase in the execution time and possible violation of deadline while the replication method increases the execution cost. In this paper, we propose a fault-tolerant workflow scheduling algorithm with near-optimal time and cost overhead. The proposed approach brings a two-fold novelty. First, we assume a stochastic model of workflow with nondeterministic task parameters and use interval arithmetic to model task execution times and propose a new scheduling algorithm in which the task assignment decisions are taken according to the performability fluctuations of the computational resources. Second, we employ an Efficient combination of resubmission and replication techniques to achieve the benefits of both and propose an algorithm for reliable scheduling of scientific workflows with near-optimal additional time and cost. The proposed method, achieves a significant increase in the reliability while the additional execution time and cost is almost negligible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Garg, R., Singh, A.K.: Adaptive workflow scheduling in grid computing based on dynamic resource availability. Eng. Sci. Technol. An Int. J. 18(2), 256–269 (Jun. 2015)

    Google Scholar 

  2. Durillo, J.J., Nae, V., Prodan, R.: Multi-objective energy-efficient workflow scheduling using list-based heuristics. Futur. Gener. Comput. Syst. 36, 221–236 (2014)

    Google Scholar 

  3. Arabnejad, H., Barbosa, J.G., Prodan, R.: Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Futur. Gener. Comput. Syst. 55, 29–40 (2016)

    Google Scholar 

  4. Garg, R., Singh, A.K.: Multi-objective workflow grid scheduling using ε-fuzzy dominance sort based discrete particle swarm optimization. J. Supercomput. 68(2), 709–732 (2014)

    Google Scholar 

  5. Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Google Scholar 

  6. R. F. Freund, D. Hensgen, M. Maheswaran, H. J. Siegel, and S. Ali, “Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems,” in Heterogeneous Computing Workshop, 1999, p. 30

  7. Wu, A.S., Yu, H., Jin, S., Lin, K.-C., Schiavone, G.: An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 15(9), 824–834 (2004)

    Google Scholar 

  8. X. Li and C. Sun, “Cost-effective heuristics for workflow scheduling in grid computing economy,” in Sixth International Conference on Grid and Cooperative Computing (GCC 2007), 2007, pp. 322–329

  9. Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. grid Comput. 12(4), 665–679 (2014)

    Google Scholar 

  10. R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. D. Dikaiakos, “Scheduling workflows with budget constraints,” in Integrated research in GRID computing, Springer, 2007, pp. 189–202

  11. D. A. Menasce and E. Casalicchio, “A Framework for Resource Allocation in Grid Computing.,” in MASCOTS, 2004, vol. 4, p. 12th

  12. J. Yu, R. Buyya, and C. K. Tham, “Cost-based scheduling of scientific workflow applications on utility grids,” in First International Conference on e-Science and Grid Computing (e-Science’05), 2005, pp. 8-pp.

  13. Khajemohammadi, H., Fanian, A., Gulliver, T.A.: Efficient workflow scheduling for grid computing using a leveled multi-objective genetic algorithm. J. Grid Comput. 12(4), 637–663 (2014)

    Google Scholar 

  14. Benoit, A., Hakem, M., Robert, Y.: Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 35(2), 83–108 (2009)

    Google Scholar 

  15. A. Benoit, M. Hakem, and Y. Robert, “Fault tolerant scheduling of precedence task graphs on heterogeneous platforms,” in 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–8

  16. J. J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, “Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems,” in Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007, pp. 280–288

  17. Tao, F., Zhao, D., Hu, Y., Zhou, Z.: Resource service composition and its optimal-selection based on particle swarm optimization in manufacturing grid system. IEEE Trans. Ind. Informatics. 4(4), 315–327 (2008)

    Google Scholar 

  18. J. H. Abawajy, “Fault-tolerant scheduling policy for grid computing systems,” in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004, p. 238

  19. Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. Comput. 6(4), 241–254 (2008)

    Google Scholar 

  20. A. Girault, H. Kalla, M. Sighireanu, and Y. Sorel, “An Algorithm for Automatically Obtaining Distributed and Fault-Tolerant Static Schedules,” 2003

  21. G. Jankowski, R. Januszewski, R. Mikolajczak, and J. Kovacs, “Grid checkpointing architecture-a revised proposal,” Inst. Grid Information, Resour. Work. Monit. Syst. CoreGRID-Network Excell. Tech. Rep. TR-0036, 2006

  22. Dabrowski, C.: Reliability in grid computing systems. Concurr. Comput. Pract. Exp. 21(8), 927–959 (2009)

    Google Scholar 

  23. M. Rahman, S. Venugopal, and R. Buyya, “A dynamic critical path algorithm for scheduling scientific workflow applications on global grids,” in Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007, pp. 35–42

  24. Rahman, M., Hassan, R., Ranjan, R., Buyya, R.: Adaptive workflow scheduling for dynamic grid and cloud computing environment. Concurr. Comput. Pract. Exp. 25(13), 1816–1842 (2013)

    Google Scholar 

  25. Papadimitriou, C.H., Tsitsiklis, J.N.: On stochastic scheduling with in-tree precedence constraints. SIAM J. Comput. 16(1), 1–6 (1987)

    MathSciNet  MATH  Google Scholar 

  26. M. Scharbrodt, T. Schickinger, and A. Steger, “A new average case analysis for completion time scheduling,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, pp. 170–178

  27. Megow, N., Uetz, M., Vredeveld, T.: Models and algorithms for stochastic online scheduling. Math. Oper. Res. 31(3), 513–525 (2006)

    MathSciNet  MATH  Google Scholar 

  28. F. Dong, J. Luo, A. Song, and J. Jin, “Resource load based stochastic DAGs scheduling mechanism for grid environment,” in 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC), 2010, pp. 197–204

  29. Jiang, J., Lin, Y., Xie, G., Fu, L., Yang, J.: Time and energy optimization algorithms for the static scheduling of multiple workflows in heterogeneous computing system. J. grid Comput. 15(4), 435–456 (2017)

    Google Scholar 

  30. V. Singh, I. Gupta, and P. K. Jana, “An Energy Efficient Algorithm for Workflow Scheduling in IaaS Cloud,” J. grid Comput., 2019, An Energy Efficient Algorithm for Workflow Scheduling in IaaS Cloud

  31. Xu, Y., Li, K., He, L., Zhang, L., Li, K.: A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems. IEEE Trans. Parallel Distrib. Syst. 26(12), 3208–3222 (2015)

    Google Scholar 

  32. D. Nanthiya and P. Keerthika, “Load balancing GridSim architecture with fault tolerance,” in 2013 International Conference on Information Communication and Embedded Systems (ICICES), 2013, pp. 425–428

  33. K. J. Naik and N. Satyanarayana, “A novel fault-tolerant task scheduling algorithm for computational grids,” in 2013 15th International Conference on Advanced Computing Technologies (ICACT), 2013, pp. 1–6

  34. S. Hwang and C. Kesselman, “Grid workflow: a flexible failure handling framework for the grid,” in High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on, 2003, pp. 126–137

  35. F. Salfner and M. Malek, Reliability Modeling of Proactive Fault Handling. Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät …, 2006

  36. C.-C. Wu, K.-C. Lai, and R.-Y. Sun, “GA-based job scheduling strategies for fault tolerant grid systems,” in 2008 IEEE Asia-Pacific Services Computing Conference, 2008, pp. 27–32

  37. N. Upadhyay and M. Misra, “Incorporating fault tolerance in GA-based scheduling in grid environment,” in 2011 World Congress on Information and Communication Technologies, 2011, pp. 772–777

  38. S. B. Priya, M. Prakash, and K. K. Dhawan, “Fault tolerance-genetic algorithm for grid task scheduling using check point,” in Sixth International Conference on Grid and Cooperative Computing (GCC 2007), 2007, pp. 676–680

  39. A. I. Alfoly, M. B. Abdelhalim, and S. Senbel, “Economic grid fault tolerance scheduling using modified genetic algorithm,” in 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), 2011, pp. 1–8

  40. Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)

    Google Scholar 

  41. Garg, R., Singh, A.K.: Fault tolerant task scheduling on computational grid using checkpointing under transient faults. Arab. J. Sci. Eng. 39(12), 8775–8791 (2014)

    MathSciNet  MATH  Google Scholar 

  42. Zheng, Q., Veeravalli, B., Tham, C.-K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2008)

    MathSciNet  MATH  Google Scholar 

  43. D. C. Vanderster, N. J. Dimopoulos, and R. J. Sobie, “Intelligent selection of fault tolerance techniques on the grid,” in Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007, pp. 69–76

  44. Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)

    MathSciNet  Google Scholar 

  45. X. Wang, R. Buyya, and J. Su, “Reliability-oriented genetic algorithm for workflow applications using max-min strategy,” in 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 108–115

  46. Y. He, Z. Shao, B. Xiao, Q. Zhuge, and E. Sha, “Reliability driven task scheduling for heterogeneous systems.,” in Fifteenth IASTED International Conference on Parallel and Distributed Computing and Systems, 2003, vol. 1, pp. 465–470

  47. Tao, Y., Jin, H., Wu, S., Shi, X., Shi, L.: Dependable grid workflow scheduling based on resource availability. J. grid Comput. 11(1), 47–61 (2013)

    Google Scholar 

  48. G. Kandaswamy, A. Mandal, and D. A. Reed, “Fault tolerance and recovery of scientific workflows on computational grids,” in 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008, pp. 777–782

  49. Y. Zhang, A. Mandal, C. Koelbel, and K. Cooper, “Combined fault tolerance and scheduling techniques for workflow applications on computational grids,” in 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 244–251

  50. Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. grid Comput. 11(3), 361–379 (2013)

    Google Scholar 

  51. A. Sangrasi and K. Djemame, “Component level risk assessment in grids: A probablistic risk model and experimentation,” in 5th IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2011), 2011, pp. 68–75

  52. Paun, M., Naksinehaboon, N., Nassar, R., Leangsuksun, C., Scott, S.L., Taerat, N.: Incremental checkpoint schemes for Weibull failure distribution. Int. J. Found. Comput. Sci. 21(03), 329–344 (2010)

    MathSciNet  MATH  Google Scholar 

  53. Y. Tao, S. Wu, and L. Shi, “Performance modeling of resource failures in grid environments,” in 2010 Fifth International Conference on Frontier of Computer Science and Technology, 2010, pp. 65–71

  54. D. Nurmi, J. Brevik, and R. Wolski, “Modeling machine availability in enterprise and wide-area distributed computing environments,” in European Conference on Parallel Processing, 2005, pp. 432–441

  55. S. M. Ross, Introduction to Probability and Statistics for Engineers and Scientists. Academic Press, 2014

  56. Buyya, R., Murshed, M.: Gridsim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exp. 14(13–15), 1175–1220 (2002)

    MATH  Google Scholar 

  57. D. P. Chandrashekar, “Robust and Fault-Tolerant Scheduling for Scientific Workflows in Cloud Computing Environments.” 2015

  58. Juve, G., Chervenak, A., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Futur. Gener. Comput. Syst. 29(3), 682–692 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Reza Naji.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Matani, A., Naji, H.R. & Motallebi, H. A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy. J Grid Computing 18, 377–394 (2020). https://doi.org/10.1007/s10723-020-09522-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-020-09522-2

Keywords