A reliable checkpoint storage strategy for grid

Malik, Sana; Nazir, Babar; Qureshi, Kalim; Khan, Imran Ali

doi:10.1007/s00607-012-0250-8

A reliable checkpoint storage strategy for grid

Published: 14 December 2012

Volume 95, pages 611–632, (2013)
Cite this article

Computing Aims and scope Submit manuscript

Sana Malik¹,
Babar Nazir¹,
Kalim Qureshi² &
…
Imran Ali Khan¹

416 Accesses
7 Citations
Explore all metrics

Abstract

Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organization. Int J Supercomput App 15:200–222
Article Google Scholar
Nandagopal M, Uthariaraj VR (2010) Fault tolerant scheduling strategy for computational grid environment. Int J Eng Sci Technol 2(9):4361–4372
Google Scholar
Halling-Brown MD, Moss DS, Shepherd AJ et al (2009) A computational grid framework for immunological applications. Philos Trans A Math Phys Eng Sci 367(1898):2705–2716
Article Google Scholar
http://www.sas.com/technologies/architecture/grid/index.html#section=1
Pande lab. http://folding.stanford.edu. Stanford University
Nazir B, Qureshi K, Manuel P (2012) Replication based fault tolerant job scheduling strategy for economy driven grid. J Supercomput 1–19
Yu J, Buya R (2005) A taxanomy of work flow management systems for grid computing. J Grid Comput 3:29
Article Google Scholar
Latchoumy P, Khader PSA (2011) Survey on fault tolerance in grid computing. Int J Comput Sci Eng Surv 2(4):97–110
Article Google Scholar
Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Elect Eng 36:1110–1122
Article MATH Google Scholar
Bouabache F, Herault T, Fedak G (2008) Hierarchical replication techniques to ensure checkpoint storage reliabilty in grid environment. In: 8th IEEE international symposium on cluster computing and the grid, pp 475–483
Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomp 56(1):106–128
Article Google Scholar
De Camarge RY, Kon F (2006) Strategies for checkpoint storage on opportunistic grids. In IEEE Comput Soc 7:1
Google Scholar
Gupta B, Rahimi S, Allam V, Jupally V (2008) Domino effect free crash recovery for concurrent failures in cluster federation. Proceedings of the 3rd international conference on advances in grid and pervasive computing, pp 4–17
Cheng CW, Wu JJ, Liu P (2008) QoS-aware, access-efficient, and storage-efficient replica placement in grid environments. J Supercomput 49:1614–1627
Google Scholar
Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18
Article Google Scholar
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. In ACM Trans Comput Syst 3:63–75
Article Google Scholar
Plank JS (1996) Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In IEEE Trans Parallel Distrib Syst, pp 76–85
Chen Z, Dongarra J (2008) A scalable checkpoint encoding algorithm for diskless checkpointing. In: 11th IEEE high assurance systems engineering, symposium, pp 71–79
Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: parallel and distributed processing symposium, p 8
Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concur Comput Pract Exp 14(13–15):1175–1220
Article MATH Google Scholar
Tamir Y, Equin C (1984) Error recovery in multicomputersusing global checkpoints. In:13th international conferenceon parallel processing, pp 32–41
Kubiatowicz J et al (2000) OceanStore: an architecture for global-scale persistent storage. SIGPLAN Not 35:11
Article Google Scholar
Adya A et al (2002) Farsite: federated, available, and reliable storage for an incompletely trusted environment. SIGOPS Oper Syst Rev 36:299–314
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, COMSATS Institute of IT, Abbottabad, Pakistan
Sana Malik, Babar Nazir & Imran Ali Khan
Department of Computer Science, Kuwait University, Kuwait City, Kuwait
Kalim Qureshi

Authors

Sana Malik
View author publications
You can also search for this author in PubMed Google Scholar
Babar Nazir
View author publications
You can also search for this author in PubMed Google Scholar
Kalim Qureshi
View author publications
You can also search for this author in PubMed Google Scholar
Imran Ali Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Babar Nazir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, S., Nazir, B., Qureshi, K. et al. A reliable checkpoint storage strategy for grid. Computing 95, 611–632 (2013). https://doi.org/10.1007/s00607-012-0250-8

Download citation

Received: 20 July 2012
Accepted: 28 November 2012
Published: 14 December 2012
Issue Date: July 2013
DOI: https://doi.org/10.1007/s00607-012-0250-8

Keywords

Mathematics Subject Classification

65C99

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A reliable checkpoint storage strategy for grid

Abstract

Access this article

Similar content being viewed by others

Dynamic resource allocation in cloud computing: analysis and taxonomies

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Supporting efficient video file streaming in P2P cloud storage

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A reliable checkpoint storage strategy for grid

Abstract

Access this article

Similar content being viewed by others

Dynamic resource allocation in cloud computing: analysis and taxonomies

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Supporting efficient video file streaming in P2P cloud storage

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation