Replication based fault tolerant job scheduling strategy for economy driven grid

Nazir, Babar; Qureshi, Kalim; Manuel, Paul

doi:10.1007/s11227-012-0756-z

Replication based fault tolerant job scheduling strategy for economy driven grid

Published: 12 April 2012

Volume 62, pages 855–873, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Babar Nazir¹,
Kalim Qureshi² &
Paul Manuel²

299 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s).

Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user’s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Foster I, Kesselman C, Tueke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomp Appl 15(3)
Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Technical Report, Open Grid Service Infrastructure WG, Global Grid Forum, June 2002
Foster I (2002) What is the grid? A three point checklist. In: GRIDToday, 20 July 2002
Google Scholar
Foster I, Kesselman C (1999) The Grid: blueprint for a new computing infrastructure, Chap 2. Morgan Kaufman, San Mateo
Google Scholar
Buyya R (2002) Economic-based distributed resource management and scheduling for grid computing. Ph.D. Thesis, Monash University, Melbourne, Australia
Buyya R, Abramson D, Venugopal S (2005) The grid economy. Proc IEEE 93(3):698–714. Special issue on grid computing. Parashar M, Lee C (eds)
Article Google Scholar
Soysa M, Buyya R, Nath B (2006) GridEmail: economically regulated Internet-based interpersonal communications. In: Dai Y, Pan Y, Raje R (eds) Advanced parallel and distributed computing: evaluation, improvement and practice. Nova Science, New York, pp 279–295
Google Scholar
Buyya R, Abramson D, Giddy J, Stockinger H (2002) Economic models for resource management and scheduling in grid computing. Concurr Comput 14(13–15):1507–1542
Article MATH Google Scholar
Buyya R, Murshed M, Abramson D, Venugopal S (2005) Scheduling parameter sweep applications on global grids: a deadline and budget constrained cost-time optimisation algorithm. Softw Pract Exp 35(5):491–512
Article Google Scholar
Buyya R, Murshed M, Abramson D (2002) A deadline and budget constrained cost-time optimization algorithm for scheduling task farming applications on global grids. In: Proceedings of the 2002 international conference on parallel and distributed processing techniques and applications (PDPTA’02), 24–27 June 2002, Las Vegas, USA
Google Scholar
Huda MT, Schmidt HW, Peake ID (2005) An agent oriented proactive fault-tolerant framework for grid computing. In: First international conference on e-science and grid computing (e-Science’05). IEEE Press, New York
Google Scholar
Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster. In: Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID’06)
Google Scholar
Fernandes Lopes R, da Silva e Silva FJ (2006) Fault tolerance in a mobile agent based computational grid. In: Proc of the sixth IEEE international symposium on cluster computing and the grid workshops (CCGRIDW’06)
Google Scholar
Burchard L-O, De Rose CAF, Heiss H-U, Linnert B, Schneider J (2005) A failure-aware grid resource management system. In: Proc of the 17th intl symposium on computer architecture and high performance computing (SBAC-PAD’05). IEEE Press, New York
Google Scholar
Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr Comput 14(13–15):1175–1220
Article MATH Google Scholar
Sulistio A, Yeo CS, Buyya R (2004) A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Int J Softw Pract Exp 34(7):653–673
Article Google Scholar
Singh G, Kesselman C, Deelman E (2007) A provisioning model and its comparison with best effort for performance-cost optimization in grids. In: Proceedings of the sixteenth IEEE international symposium on high-performance distributed computing (HPDC 2007), Monterey, California, USA, 25–29 June 2007
Google Scholar
Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18
Article Google Scholar
Nazir B, Khan T (2006) Fault tolerant job scheduling in computational grid. In: Proceedings of 2nd IEEE international conference on emerging technologies (ICET’06), Peshawar, Pakistan, pp 708–713, 13–14 November 2006
Chapter Google Scholar
Stelling P, DeMatteis C, Foster I, Kesselman C, Lee C, Laszewski GV (1998) A fault detection service for wide area distributed computations. In: 7th IEEE international symposium on high performance distributed computing, p 268, Washington, DC, USA, July 1998. ISBN:0-8186-8579-4
Chapter Google Scholar
Fault-tolerant system (2012) http://en.wikipedia.org/wiki/Fault-tolerant_system
Hwang S, Kesselman C (2003) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272. doi:10.1023/B:GRID.0000035187.54694.75
Article MATH Google Scholar
Abawajy JH (2004) Fault tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04), Santa Fe, New Mexico, 26–30 April 2004. IEEE Computer Society Press, Los Alamitos, pp 238–244
Chapter Google Scholar
Yu J, Buyya R (2006) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200. doi:10.1007/s10723-005-9010-8
Google Scholar
Gartner FC (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26
Article MathSciNet Google Scholar
Anglano C, Canonico M (2005) Fault-tolerant scheduling for bag-of-tasks grid applications. In: Lecture notes in computer science, vol 3470/2005. Springer, Berlin, pp 630–639. doi:10.1007/b137919, ISBN:978-3-540-26918-2
Chapter Google Scholar
Vanderster DC, Dimopoulos NJ, Sobie RJ (2007) Intelligent selection of fault tolerance techniques on the grid. In: Third IEEE international conference on e-science and grid computing. IEEE Computer Society, Washington. ISBN:0-7695-3064-8
Google Scholar
Gioiosa R, Sancho JC, Jiang S, Petrini F (2005) Transparent incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE SC|05 conference (SC’05)
Google Scholar
Jankowski G, Januszewski R, Mikolajczak R (2006) Grid checkpointing architecture—a revised proposal. In: CoreGRID TR-0036, 30 May 2006
Google Scholar
Hwang S, Kesselman C (2003) Workflow grid: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), Seattle, Washington, USA, 22–24 June 2003. IEEE CS Press, Los Alamitos
Google Scholar
Yeo CS, Buyya R (2005) Service level agreement based allocation of cluster resources: handling penalty to enhance utility. In: Proceedings of the 7th IEEE international conference on cluster computing, cluster 2005, Boston, Massachusetts, USA, 27–30 September 2005. IEEE CS Press, Los Alamitos
Google Scholar
Medeiros R, Cirne W, Brasileiro F, Sauvé J (2003) Faults in grids: why are they so bad and what can be done about it? In: Grid computing, 2003, Proceedings fourth international workshop, pp 18–24. ISBN:1-59593-414-6
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, COMSATS Institute of Information Technology, 22060, Tobe Camp., Abbottabad, Pakistan
Babar Nazir
Department of Information Science, Kuwait University, Safat, 13060, Kuwait
Kalim Qureshi & Paul Manuel

Authors

Babar Nazir
View author publications
You can also search for this author in PubMed Google Scholar
Kalim Qureshi
View author publications
You can also search for this author in PubMed Google Scholar
Paul Manuel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Babar Nazir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nazir, B., Qureshi, K. & Manuel, P. Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 62, 855–873 (2012). https://doi.org/10.1007/s11227-012-0756-z

Download citation

Published: 12 April 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s11227-012-0756-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Replication based fault tolerant job scheduling strategy for economy driven grid

Abstract

Access this article

Similar content being viewed by others

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

A Hybrid Fault Tolerant Scheduler for Computational Grid Environment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Replication based fault tolerant job scheduling strategy for economy driven grid

Abstract

Access this article

Similar content being viewed by others

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

A Hybrid Fault Tolerant Scheduler for Computational Grid Environment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation