Skip to main content
Log in

An adaptive task-level fault-tolerant approach to Grid

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Beguelin A, Seligman E, Stephan P (1997) Application level fault tolerance in heterogeneous networks of workstations. J Parallel Distrib Comput

  2. Cao J (2003) GridFlow: workflow management for grid computing. In: CCGrid2003, Tokyo, Japan. IEEE Press, Los Alamitos

    Google Scholar 

  3. Casanova H, Dongarra J (1996) NetSolve: a network server for solving computational science problems. In: Proceedings of the ACM/IEEE conference on supercomputing, 1996

  4. CGSV project (2008) http://www.chinagrid.edu.cn/CGSV/

  5. Deelman E (2005) A framework for mapping complex scientific workflows onto distributed systems. Sci Program J 13(3):219–237

    Google Scholar 

  6. Duda A (1983) The effects of checkpointing on program execution time. Inf Process Lett 16:221–229

    Article  MATH  MathSciNet  Google Scholar 

  7. Fahringer T (2005) ASKALON: a tool set for cluster and grid computing. Concurr Comput Pract Exp 17:143–169

    Article  Google Scholar 

  8. Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S (2002) Condor-G: a computation management agent for multi-institutional grids. Cluster Comput

  9. Hwang S, Kesselman C (2003) Grid workflow: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), 2003

  10. Jin H, (2004) ChinaGrid: making grid computing a reality. In: Digital libraries: international collaboration and cross-fertilization. LNCS, vol 3334. Springer, Berlin, pp 13–24

    Google Scholar 

  11. Li K, Shen H (2005) Coordinated enroute multimedia object caching in transcoding proxies for tree networks. ACM Trans Multimed Comput Commun Appl (TOMCAPP) 5(3):289–314

    Article  Google Scholar 

  12. Li K, Shen H, Chin FYL, Zhang W (2007) Multimedia object placement for transparent data replication. IEEE Trans Parallel Distrib Syst 18(2):212–224

    Article  Google Scholar 

  13. Li K, Shen H, Chin FYL, Zheng SQ (2005) Optimal methods for coordinated enroute web caching for tree networks. ACM Trans Internet Technol (TOIT) 5(3):480–507

    Article  Google Scholar 

  14. Oinn T (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054

    Article  Google Scholar 

  15. Plank JS, Elwasif WR (1998) Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of the 28th fault-tolerant computing symposium, 1998

  16. Stelling P, Foster I, Kesselman C, Lee C, von Laszewski G (1998) A fault detection service for wide area distributed computations. In: Proceedings of the seventh IEEE symposium on high performance distributed computing, pp 268–278, 1998

  17. Tannenbaum T, Wright D, Miller K, Livny M (2002) Condor—a distributed job scheduler. In: Beowulf cluster computing with Linux. The MIT Press, Cambridge

    Google Scholar 

  18. Taylor I, Shields M, Wang I (2003) Resource management of Triana P2P services. In: Grid resource management. Kluwer, Dordrecht

    Google Scholar 

  19. Tigr (2008) http://www.tigr.org

  20. von Laszewski G (2005) Java CoG kit workflow concepts for scientific experiments. Technical Report, Argonne National Laboratory, Argonne, IL, USA

  21. Wu Y, Wu S, Yu H et al (2005) CGSP: an extensible and reconfigurable grid framework. 2005:292–300. Springer (2005)

  22. Yu J, Buyya R (2004) A novel architecture for realizing grid workflow using tuple spaces. In: 5th IEEE/ACM international workshop on grid computing (GRID 2004), 2004

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongwei Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Y., Yuan, Y., Yang, G. et al. An adaptive task-level fault-tolerant approach to Grid. J Supercomput 51, 97–114 (2010). https://doi.org/10.1007/s11227-009-0276-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-009-0276-7

Keywords

Navigation