Abstract
Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint. Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach. Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process. In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint. Underlying computation needs to be blocked partially and only in rare cases. The algorithm tolerates the failure of an arbitrary number of nodes during the progress. Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Randell, B.: System stucture for sofware fault tolerance. IEEE Trans. Software Engg. 1(2), 220–232 (1975)
Cao, G., Singhal, M.: On the impossiblity of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computin systems. In: Proc. 27th Intl Conf. Parallel Processing, pp. 37–44 (August 1998)
Lai, T., Yang, T.: On distributed snapshots. Information Processing Letters, 153–158 (May 1987)
Cristian, F., Jahanian, F.: A time based checkpointing protocol for long lived distributed computations. In: Proc. IEEE Symp. Reliable Distributed Systems, pp. 12–20 (1991)
Ramanathan, P., Shin, K.: Use of common timebase for checkpointing and rollback recovery. IEEE Trans. Software Engg., 571–583 (June 1993)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Software Engg. 13(1), 23–31 (1987)
Leu, P., Bhargava, B.: Concurrent robust checkpointing and recovery in distributed systems. In: Proc. 4th IEEE Intl. Conf. Data Engg., pp. 154–163 (February 1988)
Venkatesan, S., Juang, T.: Low overhead optimistic crash recovery. In: Proc. 11th Intl. Conf. Distributed Computing Systems, pp. 454–461 (1991)
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. Computers 41, 526–531 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Goswami, D., Sahu, S. (2004). An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems. In: Ghosh, R.K., Mohanty, H. (eds) Distributed Computing and Internet Technology. ICDCIT 2004. Lecture Notes in Computer Science, vol 3347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30555-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-30555-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24075-4
Online ISBN: 978-3-540-30555-2
eBook Packages: Computer ScienceComputer Science (R0)