Skip to main content

An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems

  • Conference paper
  • 692 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3347))

Abstract

Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint. Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach. Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process. In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint. Underlying computation needs to be blocked partially and only in rare cases. The algorithm tolerates the failure of an arbitrary number of nodes during the progress. Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Randell, B.: System stucture for sofware fault tolerance. IEEE Trans. Software Engg. 1(2), 220–232 (1975)

    Google Scholar 

  2. Cao, G., Singhal, M.: On the impossiblity of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computin systems. In: Proc. 27th Intl Conf. Parallel Processing, pp. 37–44 (August 1998)

    Google Scholar 

  3. Lai, T., Yang, T.: On distributed snapshots. Information Processing Letters, 153–158 (May 1987)

    Google Scholar 

  4. Cristian, F., Jahanian, F.: A time based checkpointing protocol for long lived distributed computations. In: Proc. IEEE Symp. Reliable Distributed Systems, pp. 12–20 (1991)

    Google Scholar 

  5. Ramanathan, P., Shin, K.: Use of common timebase for checkpointing and rollback recovery. IEEE Trans. Software Engg., 571–583 (June 1993)

    Google Scholar 

  6. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Software Engg. 13(1), 23–31 (1987)

    Article  MATH  Google Scholar 

  7. Leu, P., Bhargava, B.: Concurrent robust checkpointing and recovery in distributed systems. In: Proc. 4th IEEE Intl. Conf. Data Engg., pp. 154–163 (February 1988)

    Google Scholar 

  8. Venkatesan, S., Juang, T.: Low overhead optimistic crash recovery. In: Proc. 11th Intl. Conf. Distributed Computing Systems, pp. 454–461 (1991)

    Google Scholar 

  9. Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. Computers 41, 526–531 (1992)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Goswami, D., Sahu, S. (2004). An Efficient Protocol for Checkpoint-Based Failure Recovery in Distributed Systems. In: Ghosh, R.K., Mohanty, H. (eds) Distributed Computing and Internet Technology. ICDCIT 2004. Lecture Notes in Computer Science, vol 3347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30555-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30555-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24075-4

  • Online ISBN: 978-3-540-30555-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics