Skip to main content
Log in

A reliable checkpoint storage strategy for grid

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organization. Int J Supercomput App 15:200–222

    Article  Google Scholar 

  2. Nandagopal M, Uthariaraj VR (2010) Fault tolerant scheduling strategy for computational grid environment. Int J Eng Sci Technol 2(9):4361–4372

    Google Scholar 

  3. Halling-Brown MD, Moss DS, Shepherd AJ et al (2009) A computational grid framework for immunological applications. Philos Trans A Math Phys Eng Sci 367(1898):2705–2716

    Article  Google Scholar 

  4. http://www.sas.com/technologies/architecture/grid/index.html#section=1

  5. Pande lab. http://folding.stanford.edu. Stanford University

  6. Nazir B, Qureshi K, Manuel P (2012) Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 1–19

  7. Yu J, Buya R (2005) A taxanomy of work flow management systems for grid computing. J Grid Comput 3:29

    Article  Google Scholar 

  8. Latchoumy P, Khader PSA (2011) Survey on fault tolerance in grid computing. Int J Comput Sci Eng Surv 2(4):97–110

    Article  Google Scholar 

  9. Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Elect Eng 36:1110–1122

    Article  MATH  Google Scholar 

  10. Bouabache F, Herault T, Fedak G (2008) Hierarchical replication techniques to ensure checkpoint storage reliabilty in grid environment. In: 8th IEEE international symposium on cluster computing and the grid, pp 475–483

  11. Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomp 56(1):106–128

    Article  Google Scholar 

  12. De Camarge RY, Kon F (2006) Strategies for checkpoint storage on opportunistic grids. In IEEE Comput Soc 7:1

    Google Scholar 

  13. Gupta B, Rahimi S, Allam V, Jupally V (2008) Domino effect free crash recovery for concurrent failures in cluster federation. Proceedings of the 3rd international conference on advances in grid and pervasive computing, pp 4–17

  14. Cheng CW, Wu JJ, Liu P (2008) QoS-aware, access-efficient, and storage-efficient replica placement in grid environments. J Supercomput 49:1614–1627

    Google Scholar 

  15. Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18

    Article  Google Scholar 

  16. Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. In ACM Trans Comput Syst 3:63–75

    Article  Google Scholar 

  17. Plank JS (1996) Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In IEEE Trans Parallel Distrib Syst, pp 76–85

  18. Chen Z, Dongarra J (2008) A scalable checkpoint encoding algorithm for diskless checkpointing. In: 11th IEEE high assurance systems engineering, symposium, pp 71–79

  19. Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: parallel and distributed processing symposium, p 8

  20. Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concur Comput Pract Exp 14(13–15):1175–1220

    Article  MATH  Google Scholar 

  21. Tamir Y, Equin C (1984) Error recovery in multicomputersusing global checkpoints. In:13th international conferenceon parallel processing, pp 32–41

  22. Kubiatowicz J et al (2000) OceanStore: an architecture for global-scale persistent storage. SIGPLAN Not 35:11

    Article  Google Scholar 

  23. Adya A et al (2002) Farsite: federated, available, and reliable storage for an incompletely trusted environment. SIGOPS Oper Syst Rev 36:299–314

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Babar Nazir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, S., Nazir, B., Qureshi, K. et al. A reliable checkpoint storage strategy for grid. Computing 95, 611–632 (2013). https://doi.org/10.1007/s00607-012-0250-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-012-0250-8

Keywords

Mathematics Subject Classification

Navigation