Skip to main content

Using Computing Checkpoints Implement Consistent Low-Cost Non-blocking Coordinated Checkpointing

  • Conference paper
Parallel and Distributed Computing: Applications and Technologies (PDCAT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3320))

Abstract

Two approaches are used to reduce the overhead associated with coordinated checkpointing:one is to reduce the number of synchronization messages and the number of checkpoints;the other is to make the checkpointing process non-blocking.In this paper, we introduce the concept of “computing checkpoint” to design an efficient consistent non-blocking coordinated checkpointing algorithm that combines these two approaches.Through piggybacking the information that which processes have taken new checkpoints in the broadcast committing message, the checkpoint sequence number of every process can be kept consistent in all processes,so that the unnecessary checkpoints and orphan messages can be avoided in the future running.The algorithm needn’t block any process and has lower overhead than other proposed consistent coordinated checkpointing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  2. Koo, R., Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering 13(1), 23–31 (1987)

    Article  MATH  Google Scholar 

  3. Kim, J.L., Park, T.: An Efficient Protocol for Checkpointing Recovery in Distributed Systems. IEEE Transactions on Parallel and Distributed Systems 5(8), 955–960 (1993)

    Article  Google Scholar 

  4. Deng, Y., Park, E.K.: Checkpointing and Rollback-Recovery Algorithms in Distributed Systems. Journal of Systems Software 4, 59–71 (1994)

    Article  Google Scholar 

  5. Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The Performance of Consistent Checkpointing. In: Proceedings of 11th Symposium on Reliable Distributed Systems, pp. 39–47. IEEE Press, Houston (1992)

    Chapter  Google Scholar 

  6. Silva, L.M., Silva, J.G.: Global Checkpointing for Distributed Programs. In: Proceedings of 11th Symposium on Reliable Distributed Systems, pp. 155–162. IEEE Press, Houston (1992)

    Chapter  Google Scholar 

  7. Helary, J.M., Netzer, R.H.B., Raynal, M.: Consistency Issues in Distributed checkpoints. IEEE Transactions on Software Engineering 25(2), 274–281 (1999)

    Article  Google Scholar 

  8. Helery, J.M., Mostefaoui, A., Raynal, M.: Communication-Induced Determination of Consistent Snapshots. IEEE Transactions on Parallel and Distributed Systems 10(9), 865–877 (1999)

    Article  Google Scholar 

  9. Netzer, R.H.B., Xu, J.: Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Transactions on Parallel and Distributed Systems 6(2), 165–169 (1995)

    Article  MATH  Google Scholar 

  10. Helary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: Preventing Useless Checkpoints in Distributed Computations. In: Proceedings of 16th Symposium on Reliable Distributed Systems, pp. 183–190. IEEE Press, Durham (1997)

    Google Scholar 

  11. Cao, G., Singhal, M.: Checkpointing with Mutable Checkpoints. Theoretical Computer Science 290, 1127–1148 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  12. Prakash, R., Singhal, M.: Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems. IEEE Transactions on Parallel and Distributed System 7(10), 1035–1048 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Men, C., Yang, X. (2004). Using Computing Checkpoints Implement Consistent Low-Cost Non-blocking Coordinated Checkpointing. In: Liew, KM., Shen, H., See, S., Cai, W., Fan, P., Horiguchi, S. (eds) Parallel and Distributed Computing: Applications and Technologies. PDCAT 2004. Lecture Notes in Computer Science, vol 3320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30501-9_109

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30501-9_109

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24013-6

  • Online ISBN: 978-3-540-30501-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics