A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

Masoumi, Maryam; Motallebi, Hassan

doi:10.1007/s11227-022-04529-w

A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

Published: 18 May 2022

Volume 78, pages 17348–17377, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

195 Accesses
Explore all metrics

Abstract

Here, we propose a fault-tolerant workflow scheduling algorithm that combines basic redundancies to reduce execution time through minimizing the redundancy overhead. We propose a graph-theory-based divide and conquer approach for selecting fault-tolerance strategies for workflow tasks. The appropriate strategy for each task is determined with respect to runtime situation and the position of the task in the graph. The main idea of the proposed algorithm is that resources are apportioned among concurrently executing tasks such that more replicas are assigned to tasks that benefit more from having extra replicas. We use the concept of concurrency graph for finding idle durations of resources which are used for processing additional task replicas. We also propose an opportunistic method for executing extra replicas of tasks in situations that some resources become idle. Furthermore, we propose a new mapping order scheme for ordering task replicas on resources. The proposed approach achieves a significant performance improvement over the existing approaches especially in situations where few resources are enrolled with the aim of cost reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

Article 25 August 2020

A Comparative Study of Task and Fault Tolerance Clustering Techniques for Scientific Workflow Applications in Cloud Platform

Scheduling of Workflows with Task Resource Requirements in Cluster Environments

References

Abrishami S, Naghibzadeh M, Epema DH (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Futur Gener Comput Syst 29(1):158–169
Article Google Scholar
Cappello F (2009) Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23(3):212–226
Article Google Scholar
Aupy G, Benoit A, Casanova H, Robert Y (2016) Checkpointing strategies for scheduling computational workflows. Int J Netw Comput 6(1):2–6
Google Scholar
Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–33
Plankensteiner K, Prodan R (2012) Meeting soft deadlines in scientific workflows using resubmission impact. IEEE Trans Parallel Distrib Syst 23(5):890–901
Article Google Scholar
Yao G, Ding Y, Hao K (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–83
Article Google Scholar
Jayadivya SK, Nirmala JS, Bhanu MSS (2012) Fault tolerant workflow scheduling based on replication and resubmission of tasks in cloud computing. Int J Comput Sci Eng 4(6):996–1006
Google Scholar
Malakoutifar N, Motallebi H (2019) Task graph scheduling in the presence of performance fluctuations of computational resources. Turk J Electr Eng Comput Sci 27:2170–2185
Article Google Scholar
Motallebi H (2020) Combining replication and checkpointing redundancies for reducing resiliency overhead. ETRI J 42:388–398
Article Google Scholar
Calheiros RN, Buyya R (2014) Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans Parallel Distrib Syst 25(7):1787–1796
Article Google Scholar
Khajemohammadi H, Fanian A, Gulliver TA (2014) Efficient workflow scheduling for grid computing using a leveled multi-objective genetic algorithm. J Grid Comput 12(4):637–663
Article Google Scholar
Topcuoglu H, Hariri S, Wu M (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274
Article Google Scholar
Benoit A, Hakem M, Robert Y (2008) Fault-tolerant scheduling of precedence task graphs on heterogeneous platforms. In: 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008 Miami, USA, pp 1–8
Zheng Q, Veeravalli B (2009) On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J Parallel Distrib Comput 69(3):282–294
Article Google Scholar
Gu Y, Wu C, Liu X, Yu D (2013) Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. Grid Comput 11(3):361–379
Article Google Scholar
Vinay K, Kumar SMD (2017) Fault-tolerant scheduling for scientific workflows in cloud environments. In: IEEE 7th International Advance Computing Conference (IACC), Hyderabad, pp 150–155
Ranaweera S, Agrawal DP (2000) A task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, Cancun, Mexico, pp 445–450
Zhao L, Ren Y, Sakurai K (2013) Reliable workflow scheduling with less resource redundancy. Parallel Comput 39(10):567–585
Article MathSciNet Google Scholar
Girault A, Kalla H, Sighireanu M, Sorel Y (2003) An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In: Proceeding of International Conference on Dependable Systems and Networks, pp 165–190
Hashimoto K, Tsuchiya T, Kikuno T (2002) Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans Inf Syst 85(3):525–534
Google Scholar
Li Y, Mascagni M (2003) Improving performance via computational replication on a large-scale computational grid. In: Proc. Third Int. Symp. Cluster Computing and the Grid (CCGrid 2003), vol 3, p 442
Dobber M, Van Der Mei R, Koole G (2009) Dynamic load balancing and job replication in a global-scale grid environment: a comparison. IEEE Trans Parallel Distrib Syst 20(2):207–218
Article Google Scholar
Tang X, Li K, Liao G, Li R (2010) List scheduling with duplication for heterogeneous computing systems. J Parallel Distrib Comput 70(4):323–329
Article Google Scholar
Chandrashekar DP (2015) Robust and fault-tolerant scheduling for scientific workflows in cloud computing environments. PhD Thesis, University of Melbourne
Das A, De Sarkar A (2012) On fault tolerance of resources in computational grids. Int J Grid Comput Appl 3(3):1–10
Google Scholar
Zhang Y, Mandal A, Koelbel C, Cooper K (2009) Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. CCGRID’09, pp 244–251
Chtepen M, Claeys FH, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190
Article Google Scholar
Matani A, Naji H, Motallebi H (2020) A fault-tolerant workflow scheduling algorithm for grid with near-optimal redundancy. J Grid Comput 1–18
Aupy G, Herrmann J (2017) Periodicity in optimal hierarchical checkpointing schemes for adjoint computations. Optim Methods Softw 32(3):594–624
Article MathSciNet Google Scholar
Sadi S, Yagoubi B (2016) Communication-aware approaches for transparent checkpointing in cloud computing. Scalable Comput Pract Exp 17(3):251–70
Google Scholar
Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International Conference on Computational Science. Springer, Berlin, pp 3–12
Benoit A, Cavelan A, Robert Y, Sun H (2016) Two-level checkpointing and verifications for linear task graphs. In: Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pp 1239–1248
Abrishami S, Naghibzadeh M, Epema DH (2012) Cost-driven scheduling of grid workflows using partial critical paths. IEEE Trans Parallel Distrib Syst 23(8):1400–14
Article Google Scholar
Arabnejad H, Barbosa JG (2014) A budget constrained scheduling algorithm for workflow applications. J Grid Comput 12(4):665–679
Article Google Scholar
Li W, Yang Y, Yuan D (2011) A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), pp 496–502
Wieczorek M, Hoheisel A, Prodan R (2008) Taxonomies of the multi-criteria grid workflow scheduling problem. Grid middleware and services. Springer, Boston, pp 237–264
Google Scholar
Kanemitsu H, Hanada M, Nakazato H (2019) Multiple workflow scheduling with offloading tasks to edge cloud. CLOUD 38–52
Devaraj R, Sarkar A (2021) Resource-optimal fault-tolerant scheduler design for task graphs using supervisory control. IEEE Trans Ind Inform 17(11):7325–7337
Article Google Scholar
Kanemitsu H, Hanada M, Nakazato H (2017) Prior node selection for scheduling workflows in a heterogeneous system. J Parallel Distrib Comput 109:155–177
Article Google Scholar
Tang X, Li K, Li R, Veeravalli B (2010) Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J Parallel Distrib Comput 70(9):941–952
Article Google Scholar
Wu H, Jaffar J, Yap R (2000) A fast algorithm for scheduling instructions with deadline constraints on risc machines. In: International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 281–290
Abd Ishak S, Wu H (2016) Energy-aware task scheduling with precedence and deadline constraints on MPSoCs. In: HPCC/SmartCity/DSS, pp 1163–1172
Roy SK, Devaraj R, Sarkar A, Senapati D (2021) SLAQA: quality-level aware scheduling of task graphs on heterogeneous distributed systems. ACM Trans Embed Comput Syst 20(5):45:1-45:31
Article Google Scholar
Zeng L, Veeravalli B, Li X (2015) SABA: a security-aware and budget-aware workflow scheduling strategy in clouds. J Parallel Distrib Comput 75:141–151
Article Google Scholar
Roy SK, Devaraj R, Sarkar A (2019) Optimal scheduling of PTGs with multiple service levels on heterogeneous distributed systems. In: ACC, pp 157–162
Roy SK, Devaraj R, Sarkar A (2021) Contention cognizant scheduling of task graphs on shared bus based heterogeneous platforms. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Roy SK, Devaraj R, Sarkar A, Sinha S, Maji K (2019) Optimal scheduling of precedence-constrained task graphs on heterogeneous distributed systems with shared buses. In: ISORC, pp 185–192
Roy SK, Devaraj R, Sarkar A, Maji K, Sinha S (2020) Contention-aware optimal scheduling of real-time precedence-constrained task graphs on heterogeneous distributed systems. J Syst Archit 105:101706
Article Google Scholar
Masud AN, Lisper B, Ciccozzi F (2018) Automatic inference of task parallelism in task-graph-based actor models. IEEE Access 6:78965–78991
Article Google Scholar
Duesterwald E, Soffa ML (1991) Concurrency analysis in the presence of procedures using a data-flow framework. In: Symposium on Testing, Analysis, and Verification, pp 36–48
Albert E, Flores-Montoya A, Genaim S, Martin-Martin E (2016) May-happen-in-parallel analysis for actor-based concurrency. ACM Trans Comput Log 17(2):11:1-11:39
Article MathSciNet Google Scholar
Diestel R (2012) Graph theory, vol 173. Graduate texts in mathematics. Springer, Berlin. ISBN 978-3-642-14278-9, pp I–XVIII, 1–436
Tomita E, Tanaka A, Takahashi H (2006) The worst-case time complexity for generating all maximal cliques and computational experiments. Theoret Comput Sci 363(1):28–42
Article MathSciNet Google Scholar
Trivedi KS (2001) Probability and statistics with reliability, queueing, and computer science applications. Wiley, London
Google Scholar
Arabnejad H, Barbosa JG, Prodan R (2016) Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Futur Gener Comput Syst 55:29–40
Article Google Scholar
Ramakrishnan L, Koelbel C, Kee YS, Wolski R, Nurmi D, Gannon D, Obertelli G, YarKhan A, Mandal A, Huang TM, Thyagaraja K (2009) VGrADS: enabling escience workflows on grids and clouds with fault tolerance. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p 47
Tao Y, Jin H, Wu S, Shi X, Shi L (2013) Dependable grid workflow scheduling based on resource availability. J Grid Comput 11(1):47–61
Article Google Scholar
Nurmi D, Brevik J, Wolski R (2005) Modeling machine availability in enterprise and wide-area distributed computing environments. In: European Conference on Parallel Processing. Springer, Berlin, pp 432–441

Download references

Author information

Authors and Affiliations

Faculty of Electrical and Computer Engineering, Graduate University of Advanced Technology (GUAT), Kerman, Iran
Maryam Masoumi & Hassan Motallebi

Authors

Maryam Masoumi
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Motallebi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hassan Motallebi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masoumi, M., Motallebi, H. A structure-aware algorithm for fault-tolerant scheduling of scientific workflows. J Supercomput 78, 17348–17377 (2022). https://doi.org/10.1007/s11227-022-04529-w

Download citation

Accepted: 07 April 2022
Published: 18 May 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11227-022-04529-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

Abstract

Access this article

Similar content being viewed by others

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

A Comparative Study of Task and Fault Tolerance Clustering Techniques for Scientific Workflow Applications in Cloud Platform

Scheduling of Workflows with Task Resource Requirements in Cluster Environments

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

Abstract

Access this article

Similar content being viewed by others

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

A Comparative Study of Task and Fault Tolerance Clustering Techniques for Scientific Workflow Applications in Cloud Platform

Scheduling of Workflows with Task Resource Requirements in Cluster Environments

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation