Skip to main content

Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment

  • Session 11: Measurement
  • Conference paper
  • First Online:
Dependable Computing — EDCC-1 (EDCC 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

  • 148 Accesses

Abstract

This paper presents an evaluation of the performance of a consistent checkpointing mechanism that has been integrated into a modular Mach microkernel based operating system. We have measured the performance overhead of checkpointing for several workstation-typical applications: number crunching and office tools. This has been done using specific servers which were added to a standard Mach 3.0/BSD system. Measurements are performed for failure-free executions by varying the number of checkpoints and thus the amount of computation lost in the event of a crash. Our initial results showed a time overhead of about 3% for up to 20% work lost in the event of a crash. while we get an overhead between 16% and 23% for up to 1% computation lost. Also, when porting interactive office tools such as the micro-emacs text editor, we get a maximal checkpoint duration of 1.4 second on our prototype machine that is as powerful as a Sun 3/60. Based on these results, we argue that checkpointing can be integrated into a modular micro-kernel based operating system without degradation of the system performances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. [Accetta et al. 86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, & M. Young, Mach: A new kernel foundation for Unix development. In Proc. of Usenix 1986 Summer Conference, pages 93–112, July 1986.

    Google Scholar 

  2. M. Ahamad & L. Lin. Using checkpoints to localize the effects of faults in distributed systems. Proc. of 8th Symposium on Reliable Distributed Systems, pages 2–11, 1989.

    Google Scholar 

  3. [Banâtre et al. 86] J.P. Banâtre, M. Banâtre, G. Lapalme, & Fl. Ployette. The design and building of enchere, a distributed electronic marketing system. Communications of the ACM, 29(1):19–29, January 1986.

    Article  Google Scholar 

  4. [Banâtre et al. 88] J.P. Banâtre, M. Banâtre, & G. Muller. Ensuring data security and integrity with a fast stable storage. In Proc. of 4th International Conference on Data Engineering, pages 285–293, Los Angeles, February 1988.

    Google Scholar 

  5. [Banâtre et al. 91] M. Banâtre, G. Muller, B. Rochat, & P. Sanchez. Design decisions for the FTM: A general purpose fault tolerant machine. In Proc. of 21st International Symposium on Fault-Tolerant Computing Systems, pages 71–78, Montréal, Canada, June 1991.

    Google Scholar 

  6. [Banâtre et al. 93] M. Banâtre, P. Heng, G. Muller, N. Peyrouze, & B. Rochat. An experience in the design of a reliable object based system, In Proc. of the 2th Conference on Parallel and Distributed Information Systems, San Diego, California, January 1993.

    Google Scholar 

  7. B. Bhargava & S.R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In Proc. of 7th Symposium on Reliable Distributed Systems. pages 3–12, 1988.

    Google Scholar 

  8. [Borg et al. 89] A. Borg, W. Blau, W. Graetsch, F. Herrmann, & W. Oberle, Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.

    Article  Google Scholar 

  9. K.M. Chandy & L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.

    Article  Google Scholar 

  10. F. Cristian & F. Jahanian. A timestamp-based checkpointing protocol for long-lived distributed computations. Proc. of 10th Symposium on Reliable Distributed Systems, pages 12–20, September 1991.

    Google Scholar 

  11. E.N. Elnozahy & W. Zwaenepoel. Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers, 41(5):526–531, May 1992.

    Article  Google Scholar 

  12. [Elnozahy et al. 92] E.N. Elnozahy, D.B. Johnson, & W. Zwaenepoel, The performance of consistent checkpointing. Proc. of 11th Symposium on Reliable Distributed Systems. pages 39–47, 1992.

    Google Scholar 

  13. Gazelle Microcircuits, Inc, Santa Clara (CA). Hot Rod High Speed Serial Link Data Sheet, 1990.

    Google Scholar 

  14. B.J. Gleeson, Fault tolerance: Why should i pay for it. In M. Banâtre & P.A. Lee, éditeurs, Workshop on Hardware and Software Architectures for Fault Tolerance: Perspective and Towards a Synthesis, volume 774 of Lecture Notes in Computer Science, pages 66–77, Le Mont Saint-Michel (France), June 1993.

    Google Scholar 

  15. [Goldberg et al. 90] A. Goldberg, A. Gopal, K. Li, R. Strom, & D.F. Bacon. Transparent recovery of mach applications. In USENIX Mach Workshop. pages 169–183, Burlington (Vermont), October 1990.

    Google Scholar 

  16. J. Gray, Notes on Database Operating Systems., volume 60 of Lecture Notes in Computer Science. Springer Verlag, 1978.

    Google Scholar 

  17. [Hue et al. 93] M. Hue, G. Muller, N. Peyrouze, & B. Rochat, Implementing dynamic atomic actions using reliable servers. In Proceedings of Esprit Basic Research Project 6360, Broadcast, First Year Report, volume 3, October 1993.

    Google Scholar 

  18. T.TY Juang & S. Venkatesan. Crash recovery with little overhead. In Proc. of 13th International Conference on Distributed Computing Systems. pages 454–452,1991.

    Google Scholar 

  19. R. Koo & S. Toueg, Checkpointing and rollback recovery for distributed systems. In Proc. of Fall Joint Computer Conference, pages 1150–1158, Dallas, 1986.

    Google Scholar 

  20. B. Lampson. Atomic transactions. In Distributed Systems and Architecture and Implementation: an Advanced Course, volume 105 of Lecture Notes in Computer Science, pages 246–265. Springer Verlag. 1981.

    Google Scholar 

  21. P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154–163, Loas Angeles (CA), February 1988.

    Google Scholar 

  22. [Li et al. 91] K. Li, J.F. Naughton & J.S. Plank. Checkpointing multicomputer applications. Proc. of 10th Symposium on Reliable Distributed Systems, pages 1–10, 1991.

    Google Scholar 

  23. P.M. Merlin & B. Randell. State restoration in distributed systems. In Proc. of 8th International Symposium on Fault-Tolerant Computing Systems, pages 129–134, Toulouse, June 1978.

    Google Scholar 

  24. [Muller et al. 91] G. Muller, B. Rochat, & P. Sanchez. A stable transactional memory for building robust object oriented programs. In EuroMicro 91, pages 359–364, Vienne, Autriche, September 1991.

    Google Scholar 

  25. B.J. Nelson. Remote Procedure Call. PhD thesis, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, 1981.

    Google Scholar 

  26. B. Rochat. Une approche à la construction de services fiables dans les systèmes distribués. Phd. thesis, université de Rennes I (France), February 1992.

    Google Scholar 

  27. [Rozier et al. 88] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, P. Léonard, S. Langlois, & W. Neuhauser. The Chorus distributed operating system. Computing Systems. 1(4):305–370, 1988.

    Google Scholar 

  28. F. Schmuck & J. Wyllie, Experience with transactions in quicksilver. In ACM, Proc. of 13th ACM Symposium on Operating Systems Principles, pages 239–253, October 1991.

    Google Scholar 

  29. L.M. Silva & J.G. Silva, Global checkpointing for distributed programs. Proc. of 11th Symposium on Reliable Distributed Systems, pages 155–162, 1992.

    Google Scholar 

  30. [Singh et al. 91] J.P. Singh, W. Weber, & A. Gupta. Splash: Stanford parallel applications for shared-memory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.

    Google Scholar 

  31. R.E. Strom & S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.

    Google Scholar 

  32. Y. Tamir & C. Sequin. Error recovery in multicomputers using global checkpoints. In Proc. of 1984 International Conference on Parallel Processing, pages 32–41. August 1984.

    Google Scholar 

  33. W.G. Wood. A decentralised recovery control protocol. Proc. of 11th International Symposium on Fault-Tolerant Computing Systems, pages 159–164, 1981.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Muller, G., Hue, M., Peyrouze, N. (1994). Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_154

Download citation

  • DOI: https://doi.org/10.1007/3-540-58426-9_154

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58426-1

  • Online ISBN: 978-3-540-48785-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics