Skip to main content
Log in

A Flexible Framework for Fault Tolerance in the Grid

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-defined exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. “Condor DAGMan”, http://www.cs.wisc.edu/condor/dagman/.

  2. “Condor Manuals”, http://www.cs.wisc.edu/condor/manual/.

  3. “The Globus Toolkit”, http://www.globus.org.

  4. D. Abramson, J. Giddy and L. Kotler, “High Performance Parametric Modeling with Nimrod/G: Killer Application for the Globus Grid”, in International Parallel and Distributed Processing Symposium (IPDPS), 2000, pp. 520–528.

  5. D. Abramson, R. Sosic, J. Giddy and B. Hall, “Nimrod: A Tool for Performing Parametised Simulations Using Distributed Workstations”, in {tiProceedings of the Fourth IEEE Symposium on High Performance Distributed Computing}, 1995.

  6. A. Beguelin, E. Seligman and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations”, Journal of Parallel and Distributed Computing on Workstation Clusters and Networked-based Computing, Vol. 43, No. 2, pp. 147–155, 1997.

    Google Scholar 

  7. J.L. Beiriger, H.P. Biven, S.L. Humphreys, W.R. Johnson and R.E. Rhea, “Constructing the ASCI Computational Grid”, in {tiProceedings of the Ninth IEEE Symposium on High Performance Distributed Computing},2000, pp. 193–199.

  8. S. Brunett, K. Czajkowski, S. Fitzgerald, I. Foster, A. Johnson, C. Kesselman, J. Leigh and S. Tuecke, “Application Experiences with the Globus Toolkit”, in {tiProceedings of the Eighth IEEE Symposium on High Performance Distributed Computing}, 1998.

  9. H. Casanova, J. Dongarra, C. Johnson and M. Miller, “Application-Specific Tools”, in I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New Computing Infrastructure, Chapter 7, pp. 159–180, 1998.

  10. K. Czajkowski, S. Fitzgerald, I. Foster and C. Kesselman, “Grid Information Services for Distributed Resource Sharing”, in {tiProceedings of the Tenth IEEE Symposium on High Performance Distributed Computing}, 2001 (to appear).

  11. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, “A Resource Management Architecture for Metacomputing Systems”, in {tiProceedings of the IPPS/SPDP'98 Workshop on Job Scheduling Strategies for Parallel Processing}, 1998.

  12. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh and S. Koranda, “Mapping Abstract Complex Workflows onto Grid Environments”, Journal of Grid Computing, Vol. 1, No. 1, pp. 25–39, 2003.

    Google Scholar 

  13. A. Duda, “The Effects of Checkpointing on Program Execution Time”, Information Processing Letters, Vol. 16, pp. 221–229, 1983.

    Google Scholar 

  14. M.C. Elder, “Fault Tolerance in Critical Information Systems”, Ph.D. thesis, University of Virginia, 2001.

  15. I. Foster, “What is the Grid? A Three Point Checklist”, GRIDToday, 2002.

  16. I. Foster and C. Kesselman, “The Globus Toolkit”, in I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New Computing Infrastructure, Chapter 11, Morgan Kaufmann Publishers, pp. 259–278, 1998.

  17. I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New computing Infrastructure, Morgan Kaufmann, 1998.

  18. J. Frey, T. Tannenbaum, I. Foster, M. Livny and S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids”, Cluster Computing, Vol. 5, No. 3, 2002.

  19. F.C. Gartner, “Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments”, ACM Computing Surveys, Vol. 31, No. 1, 1999.

  20. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine: A User's Guide and Tutorial for Network Parallel Computing. MIT Press, 1994.

  21. D. Georgakopoulos, M. Hornick and A. Sheth, “An Overview of Workflow Management: From Process Modeling to Work-flow Automation Infrastructure”, Distributed and Parallel Databases, Vol. 3, No. 2, pp. 119–153, 1995.

    Google Scholar 

  22. J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1994.

  23. A. Grimshaw, W. Wulf and T.L. Team, “The Legion Vision of a Worldwide Virtual Computer”, Communications of the ACM, 1997.

  24. A.S. Grimshaw, A. Ferrari and E.A. West, “Mentat”, in G.V. Wilson and P. Lu (eds.), Parallel Programming Using C++, Chapter 10, pp. 382–427, 1996.

  25. S. Gullapalli, K. Czajkowski, C. Kesselman and S. Fitzgerald, “The Grid Notification Framework”, Grid ForumWorking Draft GWD-GIS-019, 2001. http://www.gridforum.org.

  26. D. Hollingsworth, “Workflow Management Coalition: The Workflow Reference Model”, WfMC-TC00-1003, 1994.

  27. S. Hwang, “Grid Workflow: A Flexible Framework for Fault Tolerance in the Computational Grid”, Ph.D. thesis, University of Southern California, 2003.

  28. S. Hwang and C. Kesselman, “A Generic Failure Detection Service for the Grid”, Technical Report ISI-TR-568, USC Information Sciences Institute, 2003.

  29. I. Foster and C. Kesselman, S. T., “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, Intl. J. Supercomputer Applications, 2001.

  30. W.E. Johnson, D. Gannon and B. Nitzberg, “Grids as Production Computing Environments: The Engineering Aspects of NASA's Information Power Grid”, in {tiProceedings of the Eighth IEEE Symposium on High Performance Distributed Computing}, 1999, pp. 197–204.

  31. L. Kleinrock, Queueing Systems, Volume 1: Theory. Wiley-Interscience Publication, 1975.

  32. N. Krishnakumar and A. Sheth, “Managing Heterogeneous Multisystem Tasks to Support Enterprise-wide Operations”, Distributed and Parallel Databases, Vol. 3, No. 2, pp. 155–186, 1995.

    Google Scholar 

  33. J. Leon, A.L. Fisher and P. Steenkiste, “Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery”, Technical Report CMU-CS-93-124, Carnegie Mellon University, 1993.

  34. F. Leymann and D. Roller, Production Workflow: Concepts and Techniques, Chapter 10, pp. 351–427. Prentice Hall, 1999.

  35. M.J. Litzkow, M. Livny and M.W. Mutka, “Condor-a Hunter of Idle Workstations”, in {tiProceedings of the Eighth Intl. Conf. on Distributed Computing Systems}, 1988, pp. 104–111.

  36. R. Medeiros, W. Cirne, F. Brasileiro and J. Sauve, “Faults in Grids: Why are They so Bad and What Can Be Done about It?”, in The 4th Workshop on Grid Computing, 2003.

  37. A. Nguyen-Tuong, “Integrating Fault-Tolerance Techniques in Grid Applications”, Ph.D. thesis, University of Virginia, 2000.

  38. N.H. Page, http://ninf.apgrid.org/.

  39. J.S. Plank, M. Beck, G. Kingsley and K. Li, “Libckpt: Transparent Checkpointing under Unix”, in {tiProceedings of the the USENIX Winter Technical Conference}, New Orleans, LA, 1995.

  40. J.S. Plank and W.R. Elwasif, “Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems”, in {tiProceedings of the 28th Fault-tolerant Computing Symposium (FTCS-28)}, 1998.

  41. G.J. Popek, R.G. Guy, T.W. Page, Jr. and J.S. Heidemann, “Replication in Ficus Distributed File Systems”, in IEEE Computer Society Technical Committee on Operating Systems and Application Environments Newsletter, Vol. 4, 1990, pp. 24–29.

    Google Scholar 

  42. M. Romberg, “The UNICORE Architecture: Seamless Access to Distributed Resources”, in {tiProceedings of the Eighth IEEE Symposium on High Performance Distributed Computing}, 1999, pp. 287–293.

  43. S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka and U. Nagashima, “Ninf: Network Based Information Library for Globally High Performance Computing”, in {tiProceedings of the Parallel Object Oriented Methods and Applications( POOMA)}, 1996.

  44. P. Stelling, I. Foster, C. Kesselman, C. Lee and G. von Laszewski, “A Fault Detection Service for Wide Area Distributed Computations”, in {tiProceedings of the Seventh IEEE Symposium on High Performance Distributed Computing}, 1998, pp. 268–278.

  45. G. Stellner, “CoCheck: Checkpointing and process migration for MPF”, in 10th International Parallel Processing Symposium, 1996, pp. 526–531.

  46. D. Thain and M. Livny, “Error Scope on a Computational Grid: Theory and Practice”, in {tiProceedings of the Eleventh IEEE Symposium on High Performance Distributed Computing}, Edinburgh, Scotland, 2002.

  47. G. von Laszewski, I. Foster, J. Gawor, W. Smith and S. Tuecke, “CoG Kits: A Bridge between Commodity Distributed Computing and High-Performance Grids”, in ACM 2000 Java Grande Conference, 2000.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hwang, S., Kesselman, C. A Flexible Framework for Fault Tolerance in the Grid. Journal of Grid Computing 1, 251–272 (2003). https://doi.org/10.1023/B:GRID.0000035187.54694.75

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:GRID.0000035187.54694.75

Navigation