An adaptive task-level fault-tolerant approach to Grid

Wu, Yongwei; Yuan, Yulai; Yang, Guangwen; Zheng, Weimin

doi:10.1007/s11227-009-0276-7

An adaptive task-level fault-tolerant approach to Grid

Published: 14 March 2009

Volume 51, pages 97–114, (2010)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yongwei Wu¹,
Yulai Yuan¹,
Guangwen Yang¹ &
…
Weimin Zheng¹

85 Accesses
5 Citations
Explore all metrics

Abstract

A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Beguelin A, Seligman E, Stephan P (1997) Application level fault tolerance in heterogeneous networks of workstations. J Parallel Distrib Comput
Cao J (2003) GridFlow: workflow management for grid computing. In: CCGrid2003, Tokyo, Japan. IEEE Press, Los Alamitos
Google Scholar
Casanova H, Dongarra J (1996) NetSolve: a network server for solving computational science problems. In: Proceedings of the ACM/IEEE conference on supercomputing, 1996
CGSV project (2008) http://www.chinagrid.edu.cn/CGSV/
Deelman E (2005) A framework for mapping complex scientific workflows onto distributed systems. Sci Program J 13(3):219–237
Google Scholar
Duda A (1983) The effects of checkpointing on program execution time. Inf Process Lett 16:221–229
Article MATH MathSciNet Google Scholar
Fahringer T (2005) ASKALON: a tool set for cluster and grid computing. Concurr Comput Pract Exp 17:143–169
Article Google Scholar
Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S (2002) Condor-G: a computation management agent for multi-institutional grids. Cluster Comput
Hwang S, Kesselman C (2003) Grid workflow: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), 2003
Jin H, (2004) ChinaGrid: making grid computing a reality. In: Digital libraries: international collaboration and cross-fertilization. LNCS, vol 3334. Springer, Berlin, pp 13–24
Google Scholar
Li K, Shen H (2005) Coordinated enroute multimedia object caching in transcoding proxies for tree networks. ACM Trans Multimed Comput Commun Appl (TOMCAPP) 5(3):289–314
Article Google Scholar
Li K, Shen H, Chin FYL, Zhang W (2007) Multimedia object placement for transparent data replication. IEEE Trans Parallel Distrib Syst 18(2):212–224
Article Google Scholar
Li K, Shen H, Chin FYL, Zheng SQ (2005) Optimal methods for coordinated enroute web caching for tree networks. ACM Trans Internet Technol (TOIT) 5(3):480–507
Article Google Scholar
Oinn T (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054
Article Google Scholar
Plank JS, Elwasif WR (1998) Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of the 28th fault-tolerant computing symposium, 1998
Stelling P, Foster I, Kesselman C, Lee C, von Laszewski G (1998) A fault detection service for wide area distributed computations. In: Proceedings of the seventh IEEE symposium on high performance distributed computing, pp 268–278, 1998
Tannenbaum T, Wright D, Miller K, Livny M (2002) Condor—a distributed job scheduler. In: Beowulf cluster computing with Linux. The MIT Press, Cambridge
Google Scholar
Taylor I, Shields M, Wang I (2003) Resource management of Triana P2P services. In: Grid resource management. Kluwer, Dordrecht
Google Scholar
Tigr (2008) http://www.tigr.org
von Laszewski G (2005) Java CoG kit workflow concepts for scientific experiments. Technical Report, Argonne National Laboratory, Argonne, IL, USA
Wu Y, Wu S, Yu H et al (2005) CGSP: an extensible and reconfigurable grid framework. 2005:292–300. Springer (2005)
Yu J, Buyya R (2004) A novel architecture for realizing grid workflow using tuple spaces. In: 5th IEEE/ACM international workshop on grid computing (GRID 2004), 2004

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, People’s Republic of China
Yongwei Wu, Yulai Yuan, Guangwen Yang & Weimin Zheng

Authors

Yongwei Wu
View author publications
You can also search for this author inPubMed Google Scholar
Yulai Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author inPubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yongwei Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Y., Yuan, Y., Yang, G. et al. An adaptive task-level fault-tolerant approach to Grid. J Supercomput 51, 97–114 (2010). https://doi.org/10.1007/s11227-009-0276-7

Download citation

Received: 15 December 2008
Accepted: 18 February 2009
Published: 14 March 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s11227-009-0276-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive task-level fault-tolerant approach to Grid

Abstract

Access this article

Subscribe and save

Buy Now

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now