Abstract
In scheduling workflows in grid environment, concerns such as minimizing the makespan and cost, meeting the time and budget constraints and the possibility of resource failures and so on have motivated researchers to propose numerous scheduling algorithms. Several heuristics and meta-heuristic algorithms have been proposed to address these issues, each of which often only considers one or a few of these criteria. However, less attention has been paid to fault-tolerant scheduling of workflows. Adding fault-tolerance to a workflow scheduling algorithm leads to an inevitable increase in the makespan and cost. Using the resubmission technique may result to an unacceptable increase in the execution time and possible violation of deadline while the replication method increases the execution cost. In this paper, we propose a fault-tolerant workflow scheduling algorithm with near-optimal time and cost overhead. The proposed approach brings a two-fold novelty. First, we assume a stochastic model of workflow with nondeterministic task parameters and use interval arithmetic to model task execution times and propose a new scheduling algorithm in which the task assignment decisions are taken according to the performability fluctuations of the computational resources. Second, we employ an Efficient combination of resubmission and replication techniques to achieve the benefits of both and propose an algorithm for reliable scheduling of scientific workflows with near-optimal additional time and cost. The proposed method, achieves a significant increase in the reliability while the additional execution time and cost is almost negligible.
Similar content being viewed by others
References
Garg, R., Singh, A.K.: Adaptive workflow scheduling in grid computing based on dynamic resource availability. Eng. Sci. Technol. An Int. J. 18(2), 256–269 (Jun. 2015)
Durillo, J.J., Nae, V., Prodan, R.: Multi-objective energy-efficient workflow scheduling using list-based heuristics. Futur. Gener. Comput. Syst. 36, 221–236 (2014)
Arabnejad, H., Barbosa, J.G., Prodan, R.: Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Futur. Gener. Comput. Syst. 55, 29–40 (2016)
Garg, R., Singh, A.K.: Multi-objective workflow grid scheduling using ε-fuzzy dominance sort based discrete particle swarm optimization. J. Supercomput. 68(2), 709–732 (2014)
Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
R. F. Freund, D. Hensgen, M. Maheswaran, H. J. Siegel, and S. Ali, “Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems,” in Heterogeneous Computing Workshop, 1999, p. 30
Wu, A.S., Yu, H., Jin, S., Lin, K.-C., Schiavone, G.: An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 15(9), 824–834 (2004)
X. Li and C. Sun, “Cost-effective heuristics for workflow scheduling in grid computing economy,” in Sixth International Conference on Grid and Cooperative Computing (GCC 2007), 2007, pp. 322–329
Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. grid Comput. 12(4), 665–679 (2014)
R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. D. Dikaiakos, “Scheduling workflows with budget constraints,” in Integrated research in GRID computing, Springer, 2007, pp. 189–202
D. A. Menasce and E. Casalicchio, “A Framework for Resource Allocation in Grid Computing.,” in MASCOTS, 2004, vol. 4, p. 12th
J. Yu, R. Buyya, and C. K. Tham, “Cost-based scheduling of scientific workflow applications on utility grids,” in First International Conference on e-Science and Grid Computing (e-Science’05), 2005, pp. 8-pp.
Khajemohammadi, H., Fanian, A., Gulliver, T.A.: Efficient workflow scheduling for grid computing using a leveled multi-objective genetic algorithm. J. Grid Comput. 12(4), 637–663 (2014)
Benoit, A., Hakem, M., Robert, Y.: Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 35(2), 83–108 (2009)
A. Benoit, M. Hakem, and Y. Robert, “Fault tolerant scheduling of precedence task graphs on heterogeneous platforms,” in 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–8
J. J. Dongarra, E. Jeannot, E. Saule, and Z. Shi, “Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems,” in Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007, pp. 280–288
Tao, F., Zhao, D., Hu, Y., Zhou, Z.: Resource service composition and its optimal-selection based on particle swarm optimization in manufacturing grid system. IEEE Trans. Ind. Informatics. 4(4), 315–327 (2008)
J. H. Abawajy, “Fault-tolerant scheduling policy for grid computing systems,” in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004, p. 238
Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. Comput. 6(4), 241–254 (2008)
A. Girault, H. Kalla, M. Sighireanu, and Y. Sorel, “An Algorithm for Automatically Obtaining Distributed and Fault-Tolerant Static Schedules,” 2003
G. Jankowski, R. Januszewski, R. Mikolajczak, and J. Kovacs, “Grid checkpointing architecture-a revised proposal,” Inst. Grid Information, Resour. Work. Monit. Syst. CoreGRID-Network Excell. Tech. Rep. TR-0036, 2006
Dabrowski, C.: Reliability in grid computing systems. Concurr. Comput. Pract. Exp. 21(8), 927–959 (2009)
M. Rahman, S. Venugopal, and R. Buyya, “A dynamic critical path algorithm for scheduling scientific workflow applications on global grids,” in Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007, pp. 35–42
Rahman, M., Hassan, R., Ranjan, R., Buyya, R.: Adaptive workflow scheduling for dynamic grid and cloud computing environment. Concurr. Comput. Pract. Exp. 25(13), 1816–1842 (2013)
Papadimitriou, C.H., Tsitsiklis, J.N.: On stochastic scheduling with in-tree precedence constraints. SIAM J. Comput. 16(1), 1–6 (1987)
M. Scharbrodt, T. Schickinger, and A. Steger, “A new average case analysis for completion time scheduling,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, pp. 170–178
Megow, N., Uetz, M., Vredeveld, T.: Models and algorithms for stochastic online scheduling. Math. Oper. Res. 31(3), 513–525 (2006)
F. Dong, J. Luo, A. Song, and J. Jin, “Resource load based stochastic DAGs scheduling mechanism for grid environment,” in 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC), 2010, pp. 197–204
Jiang, J., Lin, Y., Xie, G., Fu, L., Yang, J.: Time and energy optimization algorithms for the static scheduling of multiple workflows in heterogeneous computing system. J. grid Comput. 15(4), 435–456 (2017)
V. Singh, I. Gupta, and P. K. Jana, “An Energy Efficient Algorithm for Workflow Scheduling in IaaS Cloud,” J. grid Comput., 2019, An Energy Efficient Algorithm for Workflow Scheduling in IaaS Cloud
Xu, Y., Li, K., He, L., Zhang, L., Li, K.: A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems. IEEE Trans. Parallel Distrib. Syst. 26(12), 3208–3222 (2015)
D. Nanthiya and P. Keerthika, “Load balancing GridSim architecture with fault tolerance,” in 2013 International Conference on Information Communication and Embedded Systems (ICICES), 2013, pp. 425–428
K. J. Naik and N. Satyanarayana, “A novel fault-tolerant task scheduling algorithm for computational grids,” in 2013 15th International Conference on Advanced Computing Technologies (ICACT), 2013, pp. 1–6
S. Hwang and C. Kesselman, “Grid workflow: a flexible failure handling framework for the grid,” in High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on, 2003, pp. 126–137
F. Salfner and M. Malek, Reliability Modeling of Proactive Fault Handling. Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät …, 2006
C.-C. Wu, K.-C. Lai, and R.-Y. Sun, “GA-based job scheduling strategies for fault tolerant grid systems,” in 2008 IEEE Asia-Pacific Services Computing Conference, 2008, pp. 27–32
N. Upadhyay and M. Misra, “Incorporating fault tolerance in GA-based scheduling in grid environment,” in 2011 World Congress on Information and Communication Technologies, 2011, pp. 772–777
S. B. Priya, M. Prakash, and K. K. Dhawan, “Fault tolerance-genetic algorithm for grid task scheduling using check point,” in Sixth International Conference on Grid and Cooperative Computing (GCC 2007), 2007, pp. 676–680
A. I. Alfoly, M. B. Abdelhalim, and S. Senbel, “Economic grid fault tolerance scheduling using modified genetic algorithm,” in 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), 2011, pp. 1–8
Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)
Garg, R., Singh, A.K.: Fault tolerant task scheduling on computational grid using checkpointing under transient faults. Arab. J. Sci. Eng. 39(12), 8775–8791 (2014)
Zheng, Q., Veeravalli, B., Tham, C.-K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2008)
D. C. Vanderster, N. J. Dimopoulos, and R. J. Sobie, “Intelligent selection of fault tolerance techniques on the grid,” in Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), 2007, pp. 69–76
Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)
X. Wang, R. Buyya, and J. Su, “Reliability-oriented genetic algorithm for workflow applications using max-min strategy,” in 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 108–115
Y. He, Z. Shao, B. Xiao, Q. Zhuge, and E. Sha, “Reliability driven task scheduling for heterogeneous systems.,” in Fifteenth IASTED International Conference on Parallel and Distributed Computing and Systems, 2003, vol. 1, pp. 465–470
Tao, Y., Jin, H., Wu, S., Shi, X., Shi, L.: Dependable grid workflow scheduling based on resource availability. J. grid Comput. 11(1), 47–61 (2013)
G. Kandaswamy, A. Mandal, and D. A. Reed, “Fault tolerance and recovery of scientific workflows on computational grids,” in 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008, pp. 777–782
Y. Zhang, A. Mandal, C. Koelbel, and K. Cooper, “Combined fault tolerance and scheduling techniques for workflow applications on computational grids,” in 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 244–251
Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. grid Comput. 11(3), 361–379 (2013)
A. Sangrasi and K. Djemame, “Component level risk assessment in grids: A probablistic risk model and experimentation,” in 5th IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2011), 2011, pp. 68–75
Paun, M., Naksinehaboon, N., Nassar, R., Leangsuksun, C., Scott, S.L., Taerat, N.: Incremental checkpoint schemes for Weibull failure distribution. Int. J. Found. Comput. Sci. 21(03), 329–344 (2010)
Y. Tao, S. Wu, and L. Shi, “Performance modeling of resource failures in grid environments,” in 2010 Fifth International Conference on Frontier of Computer Science and Technology, 2010, pp. 65–71
D. Nurmi, J. Brevik, and R. Wolski, “Modeling machine availability in enterprise and wide-area distributed computing environments,” in European Conference on Parallel Processing, 2005, pp. 432–441
S. M. Ross, Introduction to Probability and Statistics for Engineers and Scientists. Academic Press, 2014
Buyya, R., Murshed, M.: Gridsim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exp. 14(13–15), 1175–1220 (2002)
D. P. Chandrashekar, “Robust and Fault-Tolerant Scheduling for Scientific Workflows in Cloud Computing Environments.” 2015
Juve, G., Chervenak, A., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Futur. Gener. Comput. Syst. 29(3), 682–692 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Matani, A., Naji, H.R. & Motallebi, H. A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy. J Grid Computing 18, 377–394 (2020). https://doi.org/10.1007/s10723-020-09522-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-020-09522-2