Implementing Rollback-Recovery Coordinated Checkpoints

Buligon, Clairton; Cechin, Sérgio; Jansch-Pôrto, Ingrid

doi:10.1007/11533962_22

Clairton Buligon¹⁹,
Sérgio Cechin¹⁹ &
Ingrid Jansch-Pôrto¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3563))

Included in the following conference series:

International Symposium and School on Advancex Distributed Systems

980 Accesses
1 Citations

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local checkpoints taken by the individual processes form a consistent global checkpoint (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symposium on Reliable Systems, pp. 39–47 (1992)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3(1), 63–75 (1985)
Article Google Scholar
Strom, R.E., Yemini, S.A.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)
Article Google Scholar
Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Transactions on Parallel and Distributed Systems 7(10), 1035–1048 (1996)
Article Google Scholar
Alvisi, L., et al.: An Analysis of communication-induced checkpointing. Technical Report, TR-99-01. Department of Computer Science, Univ. of Texas, Austin (1999)
Google Scholar
Hélary, J.-M., Mostefaoui, A., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distributed Computing 13, 29–43 (2000)
Article Google Scholar
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on software engineering SE-13(1), 23–31 (1987)
Article Google Scholar
Elnozahy, E.N., Zwaenepoel, W.: Manetho: transparent rollback-recovery with lo woverhead, limited rollback and fast output commit. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531
Google Scholar
Cristian, F., Aguili, H., Strong, R.: Atomic broadcast: from simple message diffusion to Byzantine agreement. In: Proc. 15th IEEE Fault Tolerant Computer Systems, pp. 200–206 (1995)
Google Scholar
Schlichting, R.D., Schneider, F.B.: Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems 1(3), 222–238 (1983)
Article Google Scholar
Jalote, P.: Fault tolerance in distributed systems. Prentice Hall, Englewood Cliffs (1994)
Google Scholar
Cechin, S.L., Jansch-Pôrto, I.: A New Efficient Coordinated Checkpointing. In: Proc. 2nd IEEE Latin American Test Workshop, Cancun, Mexico, pp. 56–61 (2001)
Google Scholar
Lamport, L.: The temporal logic of actions. ACM Transactions on Programming Languages and Systems 16(3), 872–923 (1994)
Article Google Scholar
Cechin, S.L.: TLA formal proof of rollback recovery protocol. Technical Report (RP- 319). Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil (2002)
Google Scholar
Zhong, H., Nieh, J.: Crak: Linux checkpointing/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University, Columbia, USA (2001)
Google Scholar
Rubini, A.: Linux device drivers. Market Books (1999)
Google Scholar
Fontoura, A. B.: Evaluation of approaches for capturing the application data. MSc. Dissertation. Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (2002) (in Portuguese)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), P.O.Box 15064, Porto Alegre, RS, Brazil
Clairton Buligon, Sérgio Cechin & Ingrid Jansch-Pôrto

Authors

Clairton Buligon
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Cechin
View author publications
You can also search for this author in PubMed Google Scholar
Ingrid Jansch-Pôrto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Multi-Agent Systems Development Group, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Prolongación López Mateos Sur No. 590, Guadalajara, Jalisco, México
Félix F. Ramos
Department "Sistemas de Informacion", Universidad de Guadalajara, CUCEA, 799, Periferico Norte, Ed. L308, 45100, Zapopan, Jal., Mexico
Victor Larios Rosillo
Computer Science Dept., University of Rostock, 18051, Rostock, Germany
Herwig Unger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buligon, C., Cechin, S., Jansch-Pôrto, I. (2005). Implementing Rollback-Recovery Coordinated Checkpoints. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds) Advanced Distributed Systems. ISSADS 2005. Lecture Notes in Computer Science, vol 3563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533962_22

Download citation

DOI: https://doi.org/10.1007/11533962_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28063-7
Online ISBN: 978-3-540-31674-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics