Skip to main content

Implementing Rollback-Recovery Coordinated Checkpoints

  • Conference paper
Advanced Distributed Systems (ISSADS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3563))

Included in the following conference series:

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local checkpoints taken by the individual processes form a consistent global checkpoint (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  2. Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symposium on Reliable Systems, pp. 39–47 (1992)

    Google Scholar 

  3. Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3(1), 63–75 (1985)

    Article  Google Scholar 

  4. Strom, R.E., Yemini, S.A.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)

    Article  Google Scholar 

  5. Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Transactions on Parallel and Distributed Systems 7(10), 1035–1048 (1996)

    Article  Google Scholar 

  6. Alvisi, L., et al.: An Analysis of communication-induced checkpointing. Technical Report, TR-99-01. Department of Computer Science, Univ. of Texas, Austin (1999)

    Google Scholar 

  7. Hélary, J.-M., Mostefaoui, A., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distributed Computing 13, 29–43 (2000)

    Article  Google Scholar 

  8. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on software engineering SE-13(1), 23–31 (1987)

    Article  Google Scholar 

  9. Elnozahy, E.N., Zwaenepoel, W.: Manetho: transparent rollback-recovery with lo woverhead, limited rollback and fast output commit. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531

    Google Scholar 

  10. Cristian, F., Aguili, H., Strong, R.: Atomic broadcast: from simple message diffusion to Byzantine agreement. In: Proc. 15th IEEE Fault Tolerant Computer Systems, pp. 200–206 (1995)

    Google Scholar 

  11. Schlichting, R.D., Schneider, F.B.: Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems 1(3), 222–238 (1983)

    Article  Google Scholar 

  12. Jalote, P.: Fault tolerance in distributed systems. Prentice Hall, Englewood Cliffs (1994)

    Google Scholar 

  13. Cechin, S.L., Jansch-Pôrto, I.: A New Efficient Coordinated Checkpointing. In: Proc. 2nd IEEE Latin American Test Workshop, Cancun, Mexico, pp. 56–61 (2001)

    Google Scholar 

  14. Lamport, L.: The temporal logic of actions. ACM Transactions on Programming Languages and Systems 16(3), 872–923 (1994)

    Article  Google Scholar 

  15. Cechin, S.L.: TLA formal proof of rollback recovery protocol. Technical Report (RP- 319). Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil (2002)

    Google Scholar 

  16. Zhong, H., Nieh, J.: Crak: Linux checkpointing/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University, Columbia, USA (2001)

    Google Scholar 

  17. Rubini, A.: Linux device drivers. Market Books (1999)

    Google Scholar 

  18. Fontoura, A. B.: Evaluation of approaches for capturing the application data. MSc. Dissertation. Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (2002) (in Portuguese)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Buligon, C., Cechin, S., Jansch-Pôrto, I. (2005). Implementing Rollback-Recovery Coordinated Checkpoints. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds) Advanced Distributed Systems. ISSADS 2005. Lecture Notes in Computer Science, vol 3563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533962_22

Download citation

  • DOI: https://doi.org/10.1007/11533962_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28063-7

  • Online ISBN: 978-3-540-31674-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics