An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems

Goswami, D.; Sahu, S.

doi:10.1007/978-3-540-30555-2_16

An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems

D. Goswami¹⁸ &
S. Sahu¹⁸

Conference paper

692 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3347))

Abstract

Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint. Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach. Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process. In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint. Underlying computation needs to be blocked partially and only in rare cases. The algorithm tolerates the failure of an arbitrary number of nodes during the progress. Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Randell, B.: System stucture for sofware fault tolerance. IEEE Trans. Software Engg. 1(2), 220–232 (1975)
Google Scholar
Cao, G., Singhal, M.: On the impossiblity of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computin systems. In: Proc. 27^th Intl Conf. Parallel Processing, pp. 37–44 (August 1998)
Google Scholar
Lai, T., Yang, T.: On distributed snapshots. Information Processing Letters, 153–158 (May 1987)
Google Scholar
Cristian, F., Jahanian, F.: A time based checkpointing protocol for long lived distributed computations. In: Proc. IEEE Symp. Reliable Distributed Systems, pp. 12–20 (1991)
Google Scholar
Ramanathan, P., Shin, K.: Use of common timebase for checkpointing and rollback recovery. IEEE Trans. Software Engg., 571–583 (June 1993)
Google Scholar
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Software Engg. 13(1), 23–31 (1987)
Article MATH Google Scholar
Leu, P., Bhargava, B.: Concurrent robust checkpointing and recovery in distributed systems. In: Proc. 4^th IEEE Intl. Conf. Data Engg., pp. 154–163 (February 1988)
Google Scholar
Venkatesan, S., Juang, T.: Low overhead optimistic crash recovery. In: Proc. 11^th Intl. Conf. Distributed Computing Systems, pp. 454–461 (1991)
Google Scholar
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. Computers 41, 526–531 (1992)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Guwahati, North Guwahati, 781039, India
D. Goswami & S. Sahu

Authors

D. Goswami
View author publications
You can also search for this author in PubMed Google Scholar
S. Sahu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India
R. K. Ghosh
Department of Computer and Information Science, University of Hyderabad, Central University PO, 500 046, AP, India
Hrushikesha Mohanty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goswami, D., Sahu, S. (2004). An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems. In: Ghosh, R.K., Mohanty, H. (eds) Distributed Computing and Internet Technology. ICDCIT 2004. Lecture Notes in Computer Science, vol 3347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30555-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-30555-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24075-4
Online ISBN: 978-3-540-30555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics