Abstract
Two approaches are used to reduce the overhead associated with coordinated checkpointing:one is to reduce the number of synchronization messages and the number of checkpoints;the other is to make the checkpointing process non-blocking.In this paper, we introduce the concept of “computing checkpoint” to design an efficient consistent non-blocking coordinated checkpointing algorithm that combines these two approaches.Through piggybacking the information that which processes have taken new checkpoints in the broadcast committing message, the checkpoint sequence number of every process can be kept consistent in all processes,so that the unnecessary checkpoints and orphan messages can be avoided in the future running.The algorithm needn’t block any process and has lower overhead than other proposed consistent coordinated checkpointing algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Koo, R., Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering 13(1), 23–31 (1987)
Kim, J.L., Park, T.: An Efficient Protocol for Checkpointing Recovery in Distributed Systems. IEEE Transactions on Parallel and Distributed Systems 5(8), 955–960 (1993)
Deng, Y., Park, E.K.: Checkpointing and Rollback-Recovery Algorithms in Distributed Systems. Journal of Systems Software 4, 59–71 (1994)
Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The Performance of Consistent Checkpointing. In: Proceedings of 11th Symposium on Reliable Distributed Systems, pp. 39–47. IEEE Press, Houston (1992)
Silva, L.M., Silva, J.G.: Global Checkpointing for Distributed Programs. In: Proceedings of 11th Symposium on Reliable Distributed Systems, pp. 155–162. IEEE Press, Houston (1992)
Helary, J.M., Netzer, R.H.B., Raynal, M.: Consistency Issues in Distributed checkpoints. IEEE Transactions on Software Engineering 25(2), 274–281 (1999)
Helery, J.M., Mostefaoui, A., Raynal, M.: Communication-Induced Determination of Consistent Snapshots. IEEE Transactions on Parallel and Distributed Systems 10(9), 865–877 (1999)
Netzer, R.H.B., Xu, J.: Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Transactions on Parallel and Distributed Systems 6(2), 165–169 (1995)
Helary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: Preventing Useless Checkpoints in Distributed Computations. In: Proceedings of 16th Symposium on Reliable Distributed Systems, pp. 183–190. IEEE Press, Durham (1997)
Cao, G., Singhal, M.: Checkpointing with Mutable Checkpoints. Theoretical Computer Science 290, 1127–1148 (2003)
Prakash, R., Singhal, M.: Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems. IEEE Transactions on Parallel and Distributed System 7(10), 1035–1048 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Men, C., Yang, X. (2004). Using Computing Checkpoints Implement Consistent Low-Cost Non-blocking Coordinated Checkpointing. In: Liew, KM., Shen, H., See, S., Cai, W., Fan, P., Horiguchi, S. (eds) Parallel and Distributed Computing: Applications and Technologies. PDCAT 2004. Lecture Notes in Computer Science, vol 3320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30501-9_109
Download citation
DOI: https://doi.org/10.1007/978-3-540-30501-9_109
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24013-6
Online ISBN: 978-3-540-30501-9
eBook Packages: Computer ScienceComputer Science (R0)