Abstract
In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s).
Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user’s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults.
Similar content being viewed by others
References
Foster I, Kesselman C, Tueke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomp Appl 15(3)
Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Technical Report, Open Grid Service Infrastructure WG, Global Grid Forum, June 2002
Foster I (2002) What is the grid? A three point checklist. In: GRIDToday, 20 July 2002
Foster I, Kesselman C (1999) The Grid: blueprint for a new computing infrastructure, Chap 2. Morgan Kaufman, San Mateo
Buyya R (2002) Economic-based distributed resource management and scheduling for grid computing. Ph.D. Thesis, Monash University, Melbourne, Australia
Buyya R, Abramson D, Venugopal S (2005) The grid economy. Proc IEEE 93(3):698–714. Special issue on grid computing. Parashar M, Lee C (eds)
Soysa M, Buyya R, Nath B (2006) GridEmail: economically regulated Internet-based interpersonal communications. In: Dai Y, Pan Y, Raje R (eds) Advanced parallel and distributed computing: evaluation, improvement and practice. Nova Science, New York, pp 279–295
Buyya R, Abramson D, Giddy J, Stockinger H (2002) Economic models for resource management and scheduling in grid computing. Concurr Comput 14(13–15):1507–1542
Buyya R, Murshed M, Abramson D, Venugopal S (2005) Scheduling parameter sweep applications on global grids: a deadline and budget constrained cost-time optimisation algorithm. Softw Pract Exp 35(5):491–512
Buyya R, Murshed M, Abramson D (2002) A deadline and budget constrained cost-time optimization algorithm for scheduling task farming applications on global grids. In: Proceedings of the 2002 international conference on parallel and distributed processing techniques and applications (PDPTA’02), 24–27 June 2002, Las Vegas, USA
Huda MT, Schmidt HW, Peake ID (2005) An agent oriented proactive fault-tolerant framework for grid computing. In: First international conference on e-science and grid computing (e-Science’05). IEEE Press, New York
Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster. In: Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID’06)
Fernandes Lopes R, da Silva e Silva FJ (2006) Fault tolerance in a mobile agent based computational grid. In: Proc of the sixth IEEE international symposium on cluster computing and the grid workshops (CCGRIDW’06)
Burchard L-O, De Rose CAF, Heiss H-U, Linnert B, Schneider J (2005) A failure-aware grid resource management system. In: Proc of the 17th intl symposium on computer architecture and high performance computing (SBAC-PAD’05). IEEE Press, New York
Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr Comput 14(13–15):1175–1220
Sulistio A, Yeo CS, Buyya R (2004) A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Int J Softw Pract Exp 34(7):653–673
Singh G, Kesselman C, Deelman E (2007) A provisioning model and its comparison with best effort for performance-cost optimization in grids. In: Proceedings of the sixteenth IEEE international symposium on high-performance distributed computing (HPDC 2007), Monterey, California, USA, 25–29 June 2007
Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18
Nazir B, Khan T (2006) Fault tolerant job scheduling in computational grid. In: Proceedings of 2nd IEEE international conference on emerging technologies (ICET’06), Peshawar, Pakistan, pp 708–713, 13–14 November 2006
Stelling P, DeMatteis C, Foster I, Kesselman C, Lee C, Laszewski GV (1998) A fault detection service for wide area distributed computations. In: 7th IEEE international symposium on high performance distributed computing, p 268, Washington, DC, USA, July 1998. ISBN:0-8186-8579-4
Fault-tolerant system (2012) http://en.wikipedia.org/wiki/Fault-tolerant_system
Hwang S, Kesselman C (2003) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272. doi:10.1023/B:GRID.0000035187.54694.75
Abawajy JH (2004) Fault tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04), Santa Fe, New Mexico, 26–30 April 2004. IEEE Computer Society Press, Los Alamitos, pp 238–244
Yu J, Buyya R (2006) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200. doi:10.1007/s10723-005-9010-8
Gartner FC (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26
Anglano C, Canonico M (2005) Fault-tolerant scheduling for bag-of-tasks grid applications. In: Lecture notes in computer science, vol 3470/2005. Springer, Berlin, pp 630–639. doi:10.1007/b137919, ISBN:978-3-540-26918-2
Vanderster DC, Dimopoulos NJ, Sobie RJ (2007) Intelligent selection of fault tolerance techniques on the grid. In: Third IEEE international conference on e-science and grid computing. IEEE Computer Society, Washington. ISBN:0-7695-3064-8
Gioiosa R, Sancho JC, Jiang S, Petrini F (2005) Transparent incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE SC|05 conference (SC’05)
Jankowski G, Januszewski R, Mikolajczak R (2006) Grid checkpointing architecture—a revised proposal. In: CoreGRID TR-0036, 30 May 2006
Hwang S, Kesselman C (2003) Workflow grid: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), Seattle, Washington, USA, 22–24 June 2003. IEEE CS Press, Los Alamitos
Yeo CS, Buyya R (2005) Service level agreement based allocation of cluster resources: handling penalty to enhance utility. In: Proceedings of the 7th IEEE international conference on cluster computing, cluster 2005, Boston, Massachusetts, USA, 27–30 September 2005. IEEE CS Press, Los Alamitos
Medeiros R, Cirne W, Brasileiro F, Sauvé J (2003) Faults in grids: why are they so bad and what can be done about it? In: Grid computing, 2003, Proceedings fourth international workshop, pp 18–24. ISBN:1-59593-414-6
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nazir, B., Qureshi, K. & Manuel, P. Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 62, 855–873 (2012). https://doi.org/10.1007/s11227-012-0756-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0756-z