Abstract
A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.
References
Beguelin A, Seligman E, Stephan P (1997) Application level fault tolerance in heterogeneous networks of workstations. J Parallel Distrib Comput
Cao J (2003) GridFlow: workflow management for grid computing. In: CCGrid2003, Tokyo, Japan. IEEE Press, Los Alamitos
Casanova H, Dongarra J (1996) NetSolve: a network server for solving computational science problems. In: Proceedings of the ACM/IEEE conference on supercomputing, 1996
CGSV project (2008) http://www.chinagrid.edu.cn/CGSV/
Deelman E (2005) A framework for mapping complex scientific workflows onto distributed systems. Sci Program J 13(3):219–237
Duda A (1983) The effects of checkpointing on program execution time. Inf Process Lett 16:221–229
Fahringer T (2005) ASKALON: a tool set for cluster and grid computing. Concurr Comput Pract Exp 17:143–169
Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S (2002) Condor-G: a computation management agent for multi-institutional grids. Cluster Comput
Hwang S, Kesselman C (2003) Grid workflow: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), 2003
Jin H, (2004) ChinaGrid: making grid computing a reality. In: Digital libraries: international collaboration and cross-fertilization. LNCS, vol 3334. Springer, Berlin, pp 13–24
Li K, Shen H (2005) Coordinated enroute multimedia object caching in transcoding proxies for tree networks. ACM Trans Multimed Comput Commun Appl (TOMCAPP) 5(3):289–314
Li K, Shen H, Chin FYL, Zhang W (2007) Multimedia object placement for transparent data replication. IEEE Trans Parallel Distrib Syst 18(2):212–224
Li K, Shen H, Chin FYL, Zheng SQ (2005) Optimal methods for coordinated enroute web caching for tree networks. ACM Trans Internet Technol (TOIT) 5(3):480–507
Oinn T (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054
Plank JS, Elwasif WR (1998) Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of the 28th fault-tolerant computing symposium, 1998
Stelling P, Foster I, Kesselman C, Lee C, von Laszewski G (1998) A fault detection service for wide area distributed computations. In: Proceedings of the seventh IEEE symposium on high performance distributed computing, pp 268–278, 1998
Tannenbaum T, Wright D, Miller K, Livny M (2002) Condor—a distributed job scheduler. In: Beowulf cluster computing with Linux. The MIT Press, Cambridge
Taylor I, Shields M, Wang I (2003) Resource management of Triana P2P services. In: Grid resource management. Kluwer, Dordrecht
Tigr (2008) http://www.tigr.org
von Laszewski G (2005) Java CoG kit workflow concepts for scientific experiments. Technical Report, Argonne National Laboratory, Argonne, IL, USA
Wu Y, Wu S, Yu H et al (2005) CGSP: an extensible and reconfigurable grid framework. 2005:292–300. Springer (2005)
Yu J, Buyya R (2004) A novel architecture for realizing grid workflow using tuple spaces. In: 5th IEEE/ACM international workshop on grid computing (GRID 2004), 2004
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wu, Y., Yuan, Y., Yang, G. et al. An adaptive task-level fault-tolerant approach to Grid. J Supercomput 51, 97–114 (2010). https://doi.org/10.1007/s11227-009-0276-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0276-7