Abstract
Here, we propose a fault-tolerant workflow scheduling algorithm that combines basic redundancies to reduce execution time through minimizing the redundancy overhead. We propose a graph-theory-based divide and conquer approach for selecting fault-tolerance strategies for workflow tasks. The appropriate strategy for each task is determined with respect to runtime situation and the position of the task in the graph. The main idea of the proposed algorithm is that resources are apportioned among concurrently executing tasks such that more replicas are assigned to tasks that benefit more from having extra replicas. We use the concept of concurrency graph for finding idle durations of resources which are used for processing additional task replicas. We also propose an opportunistic method for executing extra replicas of tasks in situations that some resources become idle. Furthermore, we propose a new mapping order scheme for ordering task replicas on resources. The proposed approach achieves a significant performance improvement over the existing approaches especially in situations where few resources are enrolled with the aim of cost reduction.
Similar content being viewed by others
References
Abrishami S, Naghibzadeh M, Epema DH (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Futur Gener Comput Syst 29(1):158–169
Cappello F (2009) Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23(3):212–226
Aupy G, Benoit A, Casanova H, Robert Y (2016) Checkpointing strategies for scheduling computational workflows. Int J Netw Comput 6(1):2–6
Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–33
Plankensteiner K, Prodan R (2012) Meeting soft deadlines in scientific workflows using resubmission impact. IEEE Trans Parallel Distrib Syst 23(5):890–901
Yao G, Ding Y, Hao K (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–83
Jayadivya SK, Nirmala JS, Bhanu MSS (2012) Fault tolerant workflow scheduling based on replication and resubmission of tasks in cloud computing. Int J Comput Sci Eng 4(6):996–1006
Malakoutifar N, Motallebi H (2019) Task graph scheduling in the presence of performance fluctuations of computational resources. Turk J Electr Eng Comput Sci 27:2170–2185
Motallebi H (2020) Combining replication and checkpointing redundancies for reducing resiliency overhead. ETRI J 42:388–398
Calheiros RN, Buyya R (2014) Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans Parallel Distrib Syst 25(7):1787–1796
Khajemohammadi H, Fanian A, Gulliver TA (2014) Efficient workflow scheduling for grid computing using a leveled multi-objective genetic algorithm. J Grid Comput 12(4):637–663
Topcuoglu H, Hariri S, Wu M (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274
Benoit A, Hakem M, Robert Y (2008) Fault-tolerant scheduling of precedence task graphs on heterogeneous platforms. In: 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008 Miami, USA, pp 1–8
Zheng Q, Veeravalli B (2009) On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J Parallel Distrib Comput 69(3):282–294
Gu Y, Wu C, Liu X, Yu D (2013) Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. Grid Comput 11(3):361–379
Vinay K, Kumar SMD (2017) Fault-tolerant scheduling for scientific workflows in cloud environments. In: IEEE 7th International Advance Computing Conference (IACC), Hyderabad, pp 150–155
Ranaweera S, Agrawal DP (2000) A task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, Cancun, Mexico, pp 445–450
Zhao L, Ren Y, Sakurai K (2013) Reliable workflow scheduling with less resource redundancy. Parallel Comput 39(10):567–585
Girault A, Kalla H, Sighireanu M, Sorel Y (2003) An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In: Proceeding of International Conference on Dependable Systems and Networks, pp 165–190
Hashimoto K, Tsuchiya T, Kikuno T (2002) Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans Inf Syst 85(3):525–534
Li Y, Mascagni M (2003) Improving performance via computational replication on a large-scale computational grid. In: Proc. Third Int. Symp. Cluster Computing and the Grid (CCGrid 2003), vol 3, p 442
Dobber M, Van Der Mei R, Koole G (2009) Dynamic load balancing and job replication in a global-scale grid environment: a comparison. IEEE Trans Parallel Distrib Syst 20(2):207–218
Tang X, Li K, Liao G, Li R (2010) List scheduling with duplication for heterogeneous computing systems. J Parallel Distrib Comput 70(4):323–329
Chandrashekar DP (2015) Robust and fault-tolerant scheduling for scientific workflows in cloud computing environments. PhD Thesis, University of Melbourne
Das A, De Sarkar A (2012) On fault tolerance of resources in computational grids. Int J Grid Comput Appl 3(3):1–10
Zhang Y, Mandal A, Koelbel C, Cooper K (2009) Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. CCGRID’09, pp 244–251
Chtepen M, Claeys FH, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190
Matani A, Naji H, Motallebi H (2020) A fault-tolerant workflow scheduling algorithm for grid with near-optimal redundancy. J Grid Comput 1–18
Aupy G, Herrmann J (2017) Periodicity in optimal hierarchical checkpointing schemes for adjoint computations. Optim Methods Softw 32(3):594–624
Sadi S, Yagoubi B (2016) Communication-aware approaches for transparent checkpointing in cloud computing. Scalable Comput Pract Exp 17(3):251–70
Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International Conference on Computational Science. Springer, Berlin, pp 3–12
Benoit A, Cavelan A, Robert Y, Sun H (2016) Two-level checkpointing and verifications for linear task graphs. In: Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pp 1239–1248
Abrishami S, Naghibzadeh M, Epema DH (2012) Cost-driven scheduling of grid workflows using partial critical paths. IEEE Trans Parallel Distrib Syst 23(8):1400–14
Arabnejad H, Barbosa JG (2014) A budget constrained scheduling algorithm for workflow applications. J Grid Comput 12(4):665–679
Li W, Yang Y, Yuan D (2011) A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), pp 496–502
Wieczorek M, Hoheisel A, Prodan R (2008) Taxonomies of the multi-criteria grid workflow scheduling problem. Grid middleware and services. Springer, Boston, pp 237–264
Kanemitsu H, Hanada M, Nakazato H (2019) Multiple workflow scheduling with offloading tasks to edge cloud. CLOUD 38–52
Devaraj R, Sarkar A (2021) Resource-optimal fault-tolerant scheduler design for task graphs using supervisory control. IEEE Trans Ind Inform 17(11):7325–7337
Kanemitsu H, Hanada M, Nakazato H (2017) Prior node selection for scheduling workflows in a heterogeneous system. J Parallel Distrib Comput 109:155–177
Tang X, Li K, Li R, Veeravalli B (2010) Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J Parallel Distrib Comput 70(9):941–952
Wu H, Jaffar J, Yap R (2000) A fast algorithm for scheduling instructions with deadline constraints on risc machines. In: International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 281–290
Abd Ishak S, Wu H (2016) Energy-aware task scheduling with precedence and deadline constraints on MPSoCs. In: HPCC/SmartCity/DSS, pp 1163–1172
Roy SK, Devaraj R, Sarkar A, Senapati D (2021) SLAQA: quality-level aware scheduling of task graphs on heterogeneous distributed systems. ACM Trans Embed Comput Syst 20(5):45:1-45:31
Zeng L, Veeravalli B, Li X (2015) SABA: a security-aware and budget-aware workflow scheduling strategy in clouds. J Parallel Distrib Comput 75:141–151
Roy SK, Devaraj R, Sarkar A (2019) Optimal scheduling of PTGs with multiple service levels on heterogeneous distributed systems. In: ACC, pp 157–162
Roy SK, Devaraj R, Sarkar A (2021) Contention cognizant scheduling of task graphs on shared bus based heterogeneous platforms. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Roy SK, Devaraj R, Sarkar A, Sinha S, Maji K (2019) Optimal scheduling of precedence-constrained task graphs on heterogeneous distributed systems with shared buses. In: ISORC, pp 185–192
Roy SK, Devaraj R, Sarkar A, Maji K, Sinha S (2020) Contention-aware optimal scheduling of real-time precedence-constrained task graphs on heterogeneous distributed systems. J Syst Archit 105:101706
Masud AN, Lisper B, Ciccozzi F (2018) Automatic inference of task parallelism in task-graph-based actor models. IEEE Access 6:78965–78991
Duesterwald E, Soffa ML (1991) Concurrency analysis in the presence of procedures using a data-flow framework. In: Symposium on Testing, Analysis, and Verification, pp 36–48
Albert E, Flores-Montoya A, Genaim S, Martin-Martin E (2016) May-happen-in-parallel analysis for actor-based concurrency. ACM Trans Comput Log 17(2):11:1-11:39
Diestel R (2012) Graph theory, vol 173. Graduate texts in mathematics. Springer, Berlin. ISBN 978-3-642-14278-9, pp I–XVIII, 1–436
Tomita E, Tanaka A, Takahashi H (2006) The worst-case time complexity for generating all maximal cliques and computational experiments. Theoret Comput Sci 363(1):28–42
Trivedi KS (2001) Probability and statistics with reliability, queueing, and computer science applications. Wiley, London
Arabnejad H, Barbosa JG, Prodan R (2016) Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Futur Gener Comput Syst 55:29–40
Ramakrishnan L, Koelbel C, Kee YS, Wolski R, Nurmi D, Gannon D, Obertelli G, YarKhan A, Mandal A, Huang TM, Thyagaraja K (2009) VGrADS: enabling escience workflows on grids and clouds with fault tolerance. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p 47
Tao Y, Jin H, Wu S, Shi X, Shi L (2013) Dependable grid workflow scheduling based on resource availability. J Grid Comput 11(1):47–61
Nurmi D, Brevik J, Wolski R (2005) Modeling machine availability in enterprise and wide-area distributed computing environments. In: European Conference on Parallel Processing. Springer, Berlin, pp 432–441
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Masoumi, M., Motallebi, H. A structure-aware algorithm for fault-tolerant scheduling of scientific workflows. J Supercomput 78, 17348–17377 (2022). https://doi.org/10.1007/s11227-022-04529-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04529-w