Skip to main content
Log in

Replication based fault tolerant job scheduling strategy for economy driven grid

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s).

Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user’s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Foster I, Kesselman C, Tueke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomp Appl 15(3)

  2. Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Technical Report, Open Grid Service Infrastructure WG, Global Grid Forum, June 2002

  3. Foster I (2002) What is the grid? A three point checklist. In: GRIDToday, 20 July 2002

    Google Scholar 

  4. Foster I, Kesselman C (1999) The Grid: blueprint for a new computing infrastructure, Chap 2. Morgan Kaufman, San Mateo

    Google Scholar 

  5. Buyya R (2002) Economic-based distributed resource management and scheduling for grid computing. Ph.D. Thesis, Monash University, Melbourne, Australia

  6. Buyya R, Abramson D, Venugopal S (2005) The grid economy. Proc IEEE 93(3):698–714. Special issue on grid computing. Parashar M, Lee C (eds)

    Article  Google Scholar 

  7. Soysa M, Buyya R, Nath B (2006) GridEmail: economically regulated Internet-based interpersonal communications. In: Dai Y, Pan Y, Raje R (eds) Advanced parallel and distributed computing: evaluation, improvement and practice. Nova Science, New York, pp 279–295

    Google Scholar 

  8. Buyya R, Abramson D, Giddy J, Stockinger H (2002) Economic models for resource management and scheduling in grid computing. Concurr Comput 14(13–15):1507–1542

    Article  MATH  Google Scholar 

  9. Buyya R, Murshed M, Abramson D, Venugopal S (2005) Scheduling parameter sweep applications on global grids: a deadline and budget constrained cost-time optimisation algorithm. Softw Pract Exp 35(5):491–512

    Article  Google Scholar 

  10. Buyya R, Murshed M, Abramson D (2002) A deadline and budget constrained cost-time optimization algorithm for scheduling task farming applications on global grids. In: Proceedings of the 2002 international conference on parallel and distributed processing techniques and applications (PDPTA’02), 24–27 June 2002, Las Vegas, USA

    Google Scholar 

  11. Huda MT, Schmidt HW, Peake ID (2005) An agent oriented proactive fault-tolerant framework for grid computing. In: First international conference on e-science and grid computing (e-Science’05). IEEE Press, New York

    Google Scholar 

  12. Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster. In: Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID’06)

    Google Scholar 

  13. Fernandes Lopes R, da Silva e Silva FJ (2006) Fault tolerance in a mobile agent based computational grid. In: Proc of the sixth IEEE international symposium on cluster computing and the grid workshops (CCGRIDW’06)

    Google Scholar 

  14. Burchard L-O, De Rose CAF, Heiss H-U, Linnert B, Schneider J (2005) A failure-aware grid resource management system. In: Proc of the 17th intl symposium on computer architecture and high performance computing (SBAC-PAD’05). IEEE Press, New York

    Google Scholar 

  15. Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr Comput 14(13–15):1175–1220

    Article  MATH  Google Scholar 

  16. Sulistio A, Yeo CS, Buyya R (2004) A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Int J Softw Pract Exp 34(7):653–673

    Article  Google Scholar 

  17. Singh G, Kesselman C, Deelman E (2007) A provisioning model and its comparison with best effort for performance-cost optimization in grids. In: Proceedings of the sixteenth IEEE international symposium on high-performance distributed computing (HPDC 2007), Monterey, California, USA, 25–29 June 2007

    Google Scholar 

  18. Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18

    Article  Google Scholar 

  19. Nazir B, Khan T (2006) Fault tolerant job scheduling in computational grid. In: Proceedings of 2nd IEEE international conference on emerging technologies (ICET’06), Peshawar, Pakistan, pp 708–713, 13–14 November 2006

    Chapter  Google Scholar 

  20. Stelling P, DeMatteis C, Foster I, Kesselman C, Lee C, Laszewski GV (1998) A fault detection service for wide area distributed computations. In: 7th IEEE international symposium on high performance distributed computing, p 268, Washington, DC, USA, July 1998. ISBN:0-8186-8579-4

    Chapter  Google Scholar 

  21. Fault-tolerant system (2012) http://en.wikipedia.org/wiki/Fault-tolerant_system

  22. Hwang S, Kesselman C (2003) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272. doi:10.1023/B:GRID.0000035187.54694.75

    Article  MATH  Google Scholar 

  23. Abawajy JH (2004) Fault tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04), Santa Fe, New Mexico, 26–30 April 2004. IEEE Computer Society Press, Los Alamitos, pp 238–244

    Chapter  Google Scholar 

  24. Yu J, Buyya R (2006) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200. doi:10.1007/s10723-005-9010-8

    Google Scholar 

  25. Gartner FC (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput Surv 31(1):1–26

    Article  MathSciNet  Google Scholar 

  26. Anglano C, Canonico M (2005) Fault-tolerant scheduling for bag-of-tasks grid applications. In: Lecture notes in computer science, vol 3470/2005. Springer, Berlin, pp 630–639. doi:10.1007/b137919, ISBN:978-3-540-26918-2

    Chapter  Google Scholar 

  27. Vanderster DC, Dimopoulos NJ, Sobie RJ (2007) Intelligent selection of fault tolerance techniques on the grid. In: Third IEEE international conference on e-science and grid computing. IEEE Computer Society, Washington. ISBN:0-7695-3064-8

    Google Scholar 

  28. Gioiosa R, Sancho JC, Jiang S, Petrini F (2005) Transparent incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE SC|05 conference (SC’05)

    Google Scholar 

  29. Jankowski G, Januszewski R, Mikolajczak R (2006) Grid checkpointing architecture—a revised proposal. In: CoreGRID TR-0036, 30 May 2006

    Google Scholar 

  30. Hwang S, Kesselman C (2003) Workflow grid: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), Seattle, Washington, USA, 22–24 June 2003. IEEE CS Press, Los Alamitos

    Google Scholar 

  31. Yeo CS, Buyya R (2005) Service level agreement based allocation of cluster resources: handling penalty to enhance utility. In: Proceedings of the 7th IEEE international conference on cluster computing, cluster 2005, Boston, Massachusetts, USA, 27–30 September 2005. IEEE CS Press, Los Alamitos

    Google Scholar 

  32. Medeiros R, Cirne W, Brasileiro F, Sauvé J (2003) Faults in grids: why are they so bad and what can be done about it? In: Grid computing, 2003, Proceedings fourth international workshop, pp 18–24. ISBN:1-59593-414-6

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Babar Nazir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nazir, B., Qureshi, K. & Manuel, P. Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 62, 855–873 (2012). https://doi.org/10.1007/s11227-012-0756-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0756-z

Keywords

Navigation