Skip to main content
Log in

A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose and evaluate a framework for fault tolerant workflow execution in Grid environments. Different from previous work in the literature, our system dynamically chooses an appropriate fault tolerance technique while using a user-defined rule-based system. We also provide a generic interface that can be used to add fault tolerance techniques to the framework. The results obtained with real workflows in an experimental Grid environment show that the overhead introduced by our framework in a failure-free execution is, in the worst evaluated case, approximately 10 %. Moreover, we show that, using our framework, workflows are able to execute successfully in the presence of failures and that the framework can dynamically choose an appropriate fault tolerance technique. The main contributions of our work are twofold: the developed framework and the model-based dependability analysis we performed on it. The purpose in carrying out a model-based dependability analysis consists on evaluating the interaction between our framework and the distributed Grid environment beyond the physical limitations of an empirical evaluation. By doing this, we provide means to plan the assurance of QoS in the Grid resource allocation, while applying the fault-tolerance mechanisms we implement in our framework regardless of the underlying middleware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Chue Hong, N., Collins, B., Hardman, N., Hume, A., Knox, A., Jackson, M., et al.: The design and implementation of Grid database services in ogsa-dai. Concurrency and Computation: Practice and Experience 17(2–4), 357–376 (2005)

    Article  Google Scholar 

  2. Avižienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. on Dependable and Secure Computing 1(1), 11–33 (2004)

    Article  Google Scholar 

  3. Baier, C., Katoen, J.P.: Principles of Model Checking (Representation and Mind Series). The MIT Press (2008)

  4. Basili, V.R., Caldiera, G., Rombach, H.D.: The goal question metric approach. In: Encyclopedia of Software Engineering. Wiley (1994)

  5. Ben-Kiki, O., Evans, C., dot Net, I.: Yaml 1.2 specification, 3rd edn. http://yaml.org/spec/1.2/spec.html (2009). Accessed 24 July 2013

  6. Bian, J., Weng, C., Du, J., Li, M.: A QoS-aware and fault-tolerant workflow composition for Grid. In: GCC 2008, pp. 510–516 (2008)

  7. Bianco, A., Alfaro, L.D.: Model checking of probabilistic and nondeterministic systems. In: Foundations of Software Technology and Theoretical Computer Science, pp. 499–513. Springer-Verlag (1995)

  8. Cheung, R.C.: A user-oriented software reliability model. In: IEEE Transactions on Software Engineering, vol. 6, issue 2, pp. 118–125. IEEE (1980)

  9. Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurrency and Computation: Practice and Experience 18(10), 1021–1037 (2006)

    Article  Google Scholar 

  10. Czajkowski, K., Ferguson, D., Foster, I., Frey, J., Graham, S., Maguire, T., Snelling, D., Tuecke, S.: From open Grid services infrastructure to ws-resource framework: refactoring & evolution. In: Global Grid Forum Draft Recommendation (2004)

  11. Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.: Grid service orchestration using the business process execution language (bpel). J. Grid Comput. 3(3–4), 283–304 (2005). doi:10.1007/s10723-005-9015-3

    Article  Google Scholar 

  12. Erwin, D., Snelling, D.: UNICORE: a Grid computing environment. In: Euro-Par 2001 Parallel Processing pp. 825–834 (2001)

  13. Foster, I.: Globus toolkit version 4: Software for service-oriented systems. J. Comput. Sci. Technol. 21(4), 513–520 (2006)

    Article  Google Scholar 

  14. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. Int. Journal of HPCA 15(3), 200 (2001)

    Google Scholar 

  15. Fox, G., Gannon, D.: Special issue: workflow in Grid systems. Concurrency and Computation: Practice and Experience 18(10), 1009–1019 (2006)

    Article  Google Scholar 

  16. Object Management Group: UML 2.0 OCL Specification. Object Management Group, Inc. (2003)

  17. Object Management Group: UML 2.0 Superstructure. Object Management Group, Inc. (2010)

  18. Guimaraes, F.P., de Melo, A.C.M.A.: User-defined adaptive fault-tolerant execution of workflows in the Grid. In: CIT, pp. 356–362. IEEE Computer Society (2011)

  19. Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Asp. Comput. 6(5), 512–535 (1994)

    Article  MATH  Google Scholar 

  20. Heymans, P., Dubois, E.: Scenario-based techniques for supporting the elaboration and the validation of formal requirements. Requir. Eng. 3(3/4), 202–218 (1998)

    Article  Google Scholar 

  21. Hoare, C.A.R.: Communicating sequential processes. Commun. ACM 21(8), 666–677 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  22. Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the Grid. J. Grid Comput. 1(3), 251–272 (2003). doi:10.1023/B%3AGRID

    Article  MATH  Google Scholar 

  23. Kandaswamy, G., Mandal, A., Reed, D.: Fault tolerance and recovery of scientific workflows on computational Grids. In: CCGRID 2008, pp. 777–782 (2008)

  24. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 2.0: a tool for probabilistic model checking. In: Proceedings of The 1st Internacional Conference on Quantitative Evaluation of Systems (QEST’04), pp. 322–323. IEEE Computer Society, Washington, DC, USA (2004)

  25. Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: CCGRID 2006, vol. 1 (2006)

  26. Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2006)

    Article  Google Scholar 

  27. National science foundation: the swift parallel scripting language. http://www.ci.uchicago.edu/swift/main/ (2013). Accessed 24 July 2013

  28. Oinn, T., et al.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18(10), 1067–1100 (2006)

    Article  Google Scholar 

  29. Orvis, J., Crabtree, J., Galens, K., Gussman, A., Inman, J.M., Lee, E., Nampally, S., Riley, D., Sundaram, J.P., Felix, V., Whitty, B., Mahurkar, A., Wortman, J., White, O., Angiuoli, S.V.: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics (Oxford, England) 26(12), 1488–1492 (2010)

    Article  Google Scholar 

  30. Plankensteiner, K., Prodan, R., Fahringer, T.: A new fault tolerance heuristic for scientific workflows in highly distributed environments based on resubmission impact. In: e-Science 2009, pp. 313–320. IEEE (2009)

  31. PRISM case studies. http://www.prismmodelchecker.org/casestudies (2010). Accessed 24 July 2013

  32. Quan, D.: Error recovery mechanism for Grid-based workflow within SLA context. IJHPCN 5(1), 110–121 (2007)

    Article  Google Scholar 

  33. Rodrigues, G., Rosenblum, D., Uchitel, S.: Using scenarios to predict the reliability of concurrent component-based software systems. In: Proc. ETAPS 2005 Conference on Formal Approaches to Software Engineering, pp. 111–126. Springer, LNCS 3442 (2005)

  34. Rodrigues, G.N., Alves, V., Silveira, R., Laranjeira, L.A.: Dependability analysis in the ambient assisted living domain: an exploratory case study. J. Syst. Softw. 85, 112–131 (2012)

    Article  Google Scholar 

  35. da Silva e Silva, F.J., Kon, F., Goldman, A., Finger, M., de Camargo, R.Y., Costa, F.M., et al.: Application execution management on the integrade opportunistic Grid middleware. JPDC 70(5), 573–583 (2010)

    MATH  Google Scholar 

  36. Slomiski, A.: On using bpel extensibility to implement ogsi and wsrf Grid workflows. Concurrency and Computation: Practice and Experience 18(10), 1229–1241 (2006)

    Article  Google Scholar 

  37. de Sousa, A., et al.: A flexible fault-tolerance mechanism for the integrade Grid middleware. In: NC 2007, p. 26. IEEE Computer Society (2007)

  38. Tanimura, Y., Ikegami, T., Nakada, H., Tanaka, Y., Sekiguchi, S.: Implementation of fault-tolerant Gridrpc applications. J. Grid Comput. 4(2), 145–157 (2006). doi:10.1007/s10723-006-9044-6

    Article  MATH  Google Scholar 

  39. The cooperative computing lab—University of Notre Dame: makeflow = make + workflow (2012). http://www3.nd.edu/~ccl/software/makeflow/. Accessed 24 July 2013

  40. Tolosana-Calasanz, R., Bañares, J., Rana, O., Álvarez, P., Ezpeleta, J., Hoheisel, A.: Adaptive exception handling for scientific workflows. Concurrency and Computation: Practice and Experience 22(5), 617–642 (2010)

    Google Scholar 

  41. Tuecke, S., Czajkowski, K., Foster, I., Frey, J., Graham, S., Kesselman, C., Maguire, T., Sandholm, T., Vanderbilt, P., Snelling, D.: Open Grid Services Infrastructure (OGSI) Version 1.0. Global Grid Forum Draft Recommendation. Online available at: http://www.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf (2013). Accessed 11 Oct 2013

  42. Uchitel, S., Kramer, J., Magee, J.: Synthesis on behavioral models from scenarios. In: IEEE Transactions on Software Engineering, vol. 29, issue 2, pp. 99–115. IEEE (2003)

  43. Uchitel, S., Kramer, J., Magee, J.: Incremental elaboration of scenarios-based specifications and behavior models using implied scenarios. In: ACM Transactions on Software Engineering and Methodologies, vol. 13, issue 1, pp. 37–85. ACM Press (2004)

  44. Wang, M., Ramamohanarao, K., Chen, J.: Trust-based robust scheduling and runtime adaptation of scientific workflow. Concurrency and Computation: Practice and Experience 21(16), 1982–1998 (2009)

    Article  Google Scholar 

  45. Yu, J., Buyya, R.: A taxonomy of scientific workflow systems for Grid computing. Sigmod Record 34(3), 44–49 (2005)

    Article  Google Scholar 

  46. Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational Grids. In: CCGRID 2009, pp. 244–251. IEEE Computer Society (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Macedo Batista.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guimaraes, F.P., Célestin, P., Batista, D.M. et al. A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis. J Grid Computing 12, 127–151 (2014). https://doi.org/10.1007/s10723-013-9281-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-013-9281-4

Keywords

Navigation