Skip to main content
Log in

A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Here, we propose a fault-tolerant workflow scheduling algorithm that combines basic redundancies to reduce execution time through minimizing the redundancy overhead. We propose a graph-theory-based divide and conquer approach for selecting fault-tolerance strategies for workflow tasks. The appropriate strategy for each task is determined with respect to runtime situation and the position of the task in the graph. The main idea of the proposed algorithm is that resources are apportioned among concurrently executing tasks such that more replicas are assigned to tasks that benefit more from having extra replicas. We use the concept of concurrency graph for finding idle durations of resources which are used for processing additional task replicas. We also propose an opportunistic method for executing extra replicas of tasks in situations that some resources become idle. Furthermore, we propose a new mapping order scheme for ordering task replicas on resources. The proposed approach achieves a significant performance improvement over the existing approaches especially in situations where few resources are enrolled with the aim of cost reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Abrishami S, Naghibzadeh M, Epema DH (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Futur Gener Comput Syst 29(1):158–169

    Article  Google Scholar 

  2. Cappello F (2009) Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. Int J High Perform Comput Appl 23(3):212–226

    Article  Google Scholar 

  3. Aupy G, Benoit A, Casanova H, Robert Y (2016) Checkpointing strategies for scheduling computational workflows. Int J Netw Comput 6(1):2–6

    Google Scholar 

  4. Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–33

  5. Plankensteiner K, Prodan R (2012) Meeting soft deadlines in scientific workflows using resubmission impact. IEEE Trans Parallel Distrib Syst 23(5):890–901

    Article  Google Scholar 

  6. Yao G, Ding Y, Hao K (2017) Using imbalance characteristic for fault-tolerant workflow scheduling in cloud systems. IEEE Trans Parallel Distrib Syst 28(12):3671–83

    Article  Google Scholar 

  7. Jayadivya SK, Nirmala JS, Bhanu MSS (2012) Fault tolerant workflow scheduling based on replication and resubmission of tasks in cloud computing. Int J Comput Sci Eng 4(6):996–1006

    Google Scholar 

  8. Malakoutifar N, Motallebi H (2019) Task graph scheduling in the presence of performance fluctuations of computational resources. Turk J Electr Eng Comput Sci 27:2170–2185

    Article  Google Scholar 

  9. Motallebi H (2020) Combining replication and checkpointing redundancies for reducing resiliency overhead. ETRI J 42:388–398

    Article  Google Scholar 

  10. Calheiros RN, Buyya R (2014) Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE Trans Parallel Distrib Syst 25(7):1787–1796

    Article  Google Scholar 

  11. Khajemohammadi H, Fanian A, Gulliver TA (2014) Efficient workflow scheduling for grid computing using a leveled multi-objective genetic algorithm. J Grid Comput 12(4):637–663

    Article  Google Scholar 

  12. Topcuoglu H, Hariri S, Wu M (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274

    Article  Google Scholar 

  13. Benoit A, Hakem M, Robert Y (2008) Fault-tolerant scheduling of precedence task graphs on heterogeneous platforms. In: 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008 Miami, USA, pp 1–8

  14. Zheng Q, Veeravalli B (2009) On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J Parallel Distrib Comput 69(3):282–294

    Article  Google Scholar 

  15. Gu Y, Wu C, Liu X, Yu D (2013) Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. Grid Comput 11(3):361–379

    Article  Google Scholar 

  16. Vinay K, Kumar SMD (2017) Fault-tolerant scheduling for scientific workflows in cloud environments. In: IEEE 7th International Advance Computing Conference (IACC), Hyderabad, pp 150–155

  17. Ranaweera S, Agrawal DP (2000) A task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, Cancun, Mexico, pp 445–450

  18. Zhao L, Ren Y, Sakurai K (2013) Reliable workflow scheduling with less resource redundancy. Parallel Comput 39(10):567–585

    Article  MathSciNet  Google Scholar 

  19. Girault A, Kalla H, Sighireanu M, Sorel Y (2003) An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In: Proceeding of International Conference on Dependable Systems and Networks, pp 165–190

  20. Hashimoto K, Tsuchiya T, Kikuno T (2002) Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans Inf Syst 85(3):525–534

    Google Scholar 

  21. Li Y, Mascagni M (2003) Improving performance via computational replication on a large-scale computational grid. In: Proc. Third Int. Symp. Cluster Computing and the Grid (CCGrid 2003), vol 3, p 442

  22. Dobber M, Van Der Mei R, Koole G (2009) Dynamic load balancing and job replication in a global-scale grid environment: a comparison. IEEE Trans Parallel Distrib Syst 20(2):207–218

    Article  Google Scholar 

  23. Tang X, Li K, Liao G, Li R (2010) List scheduling with duplication for heterogeneous computing systems. J Parallel Distrib Comput 70(4):323–329

    Article  Google Scholar 

  24. Chandrashekar DP (2015) Robust and fault-tolerant scheduling for scientific workflows in cloud computing environments. PhD Thesis, University of Melbourne

  25. Das A, De Sarkar A (2012) On fault tolerance of resources in computational grids. Int J Grid Comput Appl 3(3):1–10

    Google Scholar 

  26. Zhang Y, Mandal A, Koelbel C, Cooper K (2009) Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. CCGRID’09, pp 244–251

  27. Chtepen M, Claeys FH, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190

    Article  Google Scholar 

  28. Matani A, Naji H, Motallebi H (2020) A fault-tolerant workflow scheduling algorithm for grid with near-optimal redundancy. J Grid Comput 1–18

  29. Aupy G, Herrmann J (2017) Periodicity in optimal hierarchical checkpointing schemes for adjoint computations. Optim Methods Softw 32(3):594–624

    Article  MathSciNet  Google Scholar 

  30. Sadi S, Yagoubi B (2016) Communication-aware approaches for transparent checkpointing in cloud computing. Scalable Comput Pract Exp 17(3):251–70

    Google Scholar 

  31. Daly J (2003) A model for predicting the optimum checkpoint interval for restart dumps. In: International Conference on Computational Science. Springer, Berlin, pp 3–12

  32. Benoit A, Cavelan A, Robert Y, Sun H (2016) Two-level checkpointing and verifications for linear task graphs. In: Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pp 1239–1248

  33. Abrishami S, Naghibzadeh M, Epema DH (2012) Cost-driven scheduling of grid workflows using partial critical paths. IEEE Trans Parallel Distrib Syst 23(8):1400–14

    Article  Google Scholar 

  34. Arabnejad H, Barbosa JG (2014) A budget constrained scheduling algorithm for workflow applications. J Grid Comput 12(4):665–679

    Article  Google Scholar 

  35. Li W, Yang Y, Yuan D (2011) A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), pp 496–502

  36. Wieczorek M, Hoheisel A, Prodan R (2008) Taxonomies of the multi-criteria grid workflow scheduling problem. Grid middleware and services. Springer, Boston, pp 237–264

    Google Scholar 

  37. Kanemitsu H, Hanada M, Nakazato H (2019) Multiple workflow scheduling with offloading tasks to edge cloud. CLOUD 38–52

  38. Devaraj R, Sarkar A (2021) Resource-optimal fault-tolerant scheduler design for task graphs using supervisory control. IEEE Trans Ind Inform 17(11):7325–7337

    Article  Google Scholar 

  39. Kanemitsu H, Hanada M, Nakazato H (2017) Prior node selection for scheduling workflows in a heterogeneous system. J Parallel Distrib Comput 109:155–177

    Article  Google Scholar 

  40. Tang X, Li K, Li R, Veeravalli B (2010) Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J Parallel Distrib Comput 70(9):941–952

    Article  Google Scholar 

  41. Wu H, Jaffar J, Yap R (2000) A fast algorithm for scheduling instructions with deadline constraints on risc machines. In: International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 281–290

  42. Abd Ishak S, Wu H (2016) Energy-aware task scheduling with precedence and deadline constraints on MPSoCs. In: HPCC/SmartCity/DSS, pp 1163–1172

  43. Roy SK, Devaraj R, Sarkar A, Senapati D (2021) SLAQA: quality-level aware scheduling of task graphs on heterogeneous distributed systems. ACM Trans Embed Comput Syst 20(5):45:1-45:31

    Article  Google Scholar 

  44. Zeng L, Veeravalli B, Li X (2015) SABA: a security-aware and budget-aware workflow scheduling strategy in clouds. J Parallel Distrib Comput 75:141–151

    Article  Google Scholar 

  45. Roy SK, Devaraj R, Sarkar A (2019) Optimal scheduling of PTGs with multiple service levels on heterogeneous distributed systems. In: ACC, pp 157–162

  46. Roy SK, Devaraj R, Sarkar A (2021) Contention cognizant scheduling of task graphs on shared bus based heterogeneous platforms. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

  47. Roy SK, Devaraj R, Sarkar A, Sinha S, Maji K (2019) Optimal scheduling of precedence-constrained task graphs on heterogeneous distributed systems with shared buses. In: ISORC, pp 185–192

  48. Roy SK, Devaraj R, Sarkar A, Maji K, Sinha S (2020) Contention-aware optimal scheduling of real-time precedence-constrained task graphs on heterogeneous distributed systems. J Syst Archit 105:101706

    Article  Google Scholar 

  49. Masud AN, Lisper B, Ciccozzi F (2018) Automatic inference of task parallelism in task-graph-based actor models. IEEE Access 6:78965–78991

    Article  Google Scholar 

  50. Duesterwald E, Soffa ML (1991) Concurrency analysis in the presence of procedures using a data-flow framework. In: Symposium on Testing, Analysis, and Verification, pp 36–48

  51. Albert E, Flores-Montoya A, Genaim S, Martin-Martin E (2016) May-happen-in-parallel analysis for actor-based concurrency. ACM Trans Comput Log 17(2):11:1-11:39

    Article  MathSciNet  Google Scholar 

  52. Diestel R (2012) Graph theory, vol 173. Graduate texts in mathematics. Springer, Berlin. ISBN 978-3-642-14278-9, pp I–XVIII, 1–436

  53. Tomita E, Tanaka A, Takahashi H (2006) The worst-case time complexity for generating all maximal cliques and computational experiments. Theoret Comput Sci 363(1):28–42

    Article  MathSciNet  Google Scholar 

  54. Trivedi KS (2001) Probability and statistics with reliability, queueing, and computer science applications. Wiley, London

    Google Scholar 

  55. Arabnejad H, Barbosa JG, Prodan R (2016) Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Futur Gener Comput Syst 55:29–40

    Article  Google Scholar 

  56. Ramakrishnan L, Koelbel C, Kee YS, Wolski R, Nurmi D, Gannon D, Obertelli G, YarKhan A, Mandal A, Huang TM, Thyagaraja K (2009) VGrADS: enabling escience workflows on grids and clouds with fault tolerance. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p 47

  57. Tao Y, Jin H, Wu S, Shi X, Shi L (2013) Dependable grid workflow scheduling based on resource availability. J Grid Comput 11(1):47–61

    Article  Google Scholar 

  58. Nurmi D, Brevik J, Wolski R (2005) Modeling machine availability in enterprise and wide-area distributed computing environments. In: European Conference on Parallel Processing. Springer, Berlin, pp 432–441

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hassan Motallebi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Masoumi, M., Motallebi, H. A structure-aware algorithm for fault-tolerant scheduling of scientific workflows. J Supercomput 78, 17348–17377 (2022). https://doi.org/10.1007/s11227-022-04529-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04529-w

Keywords

Navigation