Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment

Muller, Gilles; Hue, Mireille; Peyrouze, Nadine

doi:10.1007/3-540-58426-9_154

Gilles Muller¹,
Mireille Hue² &
Nadine Peyrouze²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

European Dependable Computing Conference

149 Accesses

Abstract

This paper presents an evaluation of the performance of a consistent checkpointing mechanism that has been integrated into a modular Mach microkernel based operating system. We have measured the performance overhead of checkpointing for several workstation-typical applications: number crunching and office tools. This has been done using specific servers which were added to a standard Mach 3.0/BSD system. Measurements are performed for failure-free executions by varying the number of checkpoints and thus the amount of computation lost in the event of a crash. Our initial results showed a time overhead of about 3% for up to 20% work lost in the event of a crash. while we get an overhead between 16% and 23% for up to 1% computation lost. Also, when porting interactive office tools such as the micro-emacs text editor, we get a maximal checkpoint duration of 1.4 second on our prototype machine that is as powerful as a Sun 3/60. Based on these results, we argue that checkpointing can be integrated into a modular micro-kernel based operating system without degradation of the system performances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Checkpointing Methods and Their Effectiveness in Keyboard Data Entry Operations with Large Access Overhead

References

[Accetta et al. 86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, & M. Young, Mach: A new kernel foundation for Unix development. In Proc. of Usenix 1986 Summer Conference, pages 93–112, July 1986.
Google Scholar
M. Ahamad & L. Lin. Using checkpoints to localize the effects of faults in distributed systems. Proc. of 8th Symposium on Reliable Distributed Systems, pages 2–11, 1989.
Google Scholar
[Banâtre et al. 86] J.P. Banâtre, M. Banâtre, G. Lapalme, & Fl. Ployette. The design and building of enchere, a distributed electronic marketing system. Communications of the ACM, 29(1):19–29, January 1986.
Article Google Scholar
[Banâtre et al. 88] J.P. Banâtre, M. Banâtre, & G. Muller. Ensuring data security and integrity with a fast stable storage. In Proc. of 4th International Conference on Data Engineering, pages 285–293, Los Angeles, February 1988.
Google Scholar
[Banâtre et al. 91] M. Banâtre, G. Muller, B. Rochat, & P. Sanchez. Design decisions for the FTM: A general purpose fault tolerant machine. In Proc. of 21st International Symposium on Fault-Tolerant Computing Systems, pages 71–78, Montréal, Canada, June 1991.
Google Scholar
[Banâtre et al. 93] M. Banâtre, P. Heng, G. Muller, N. Peyrouze, & B. Rochat. An experience in the design of a reliable object based system, In Proc. of the 2th Conference on Parallel and Distributed Information Systems, San Diego, California, January 1993.
Google Scholar
B. Bhargava & S.R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In Proc. of 7th Symposium on Reliable Distributed Systems. pages 3–12, 1988.
Google Scholar
[Borg et al. 89] A. Borg, W. Blau, W. Graetsch, F. Herrmann, & W. Oberle, Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.
Article Google Scholar
K.M. Chandy & L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.
Article Google Scholar
F. Cristian & F. Jahanian. A timestamp-based checkpointing protocol for long-lived distributed computations. Proc. of 10th Symposium on Reliable Distributed Systems, pages 12–20, September 1991.
Google Scholar
E.N. Elnozahy & W. Zwaenepoel. Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers, 41(5):526–531, May 1992.
Article Google Scholar
[Elnozahy et al. 92] E.N. Elnozahy, D.B. Johnson, & W. Zwaenepoel, The performance of consistent checkpointing. Proc. of 11th Symposium on Reliable Distributed Systems. pages 39–47, 1992.
Google Scholar
Gazelle Microcircuits, Inc, Santa Clara (CA). Hot Rod High Speed Serial Link Data Sheet, 1990.
Google Scholar
B.J. Gleeson, Fault tolerance: Why should i pay for it. In M. Banâtre & P.A. Lee, éditeurs, Workshop on Hardware and Software Architectures for Fault Tolerance: Perspective and Towards a Synthesis, volume 774 of Lecture Notes in Computer Science, pages 66–77, Le Mont Saint-Michel (France), June 1993.
Google Scholar
[Goldberg et al. 90] A. Goldberg, A. Gopal, K. Li, R. Strom, & D.F. Bacon. Transparent recovery of mach applications. In USENIX Mach Workshop. pages 169–183, Burlington (Vermont), October 1990.
Google Scholar
J. Gray, Notes on Database Operating Systems., volume 60 of Lecture Notes in Computer Science. Springer Verlag, 1978.
Google Scholar
[Hue et al. 93] M. Hue, G. Muller, N. Peyrouze, & B. Rochat, Implementing dynamic atomic actions using reliable servers. In Proceedings of Esprit Basic Research Project 6360, Broadcast, First Year Report, volume 3, October 1993.
Google Scholar
T.TY Juang & S. Venkatesan. Crash recovery with little overhead. In Proc. of 13th International Conference on Distributed Computing Systems. pages 454–452,1991.
Google Scholar
R. Koo & S. Toueg, Checkpointing and rollback recovery for distributed systems. In Proc. of Fall Joint Computer Conference, pages 1150–1158, Dallas, 1986.
Google Scholar
B. Lampson. Atomic transactions. In Distributed Systems and Architecture and Implementation: an Advanced Course, volume 105 of Lecture Notes in Computer Science, pages 246–265. Springer Verlag. 1981.
Google Scholar
P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154–163, Loas Angeles (CA), February 1988.
Google Scholar
[Li et al. 91] K. Li, J.F. Naughton & J.S. Plank. Checkpointing multicomputer applications. Proc. of 10th Symposium on Reliable Distributed Systems, pages 1–10, 1991.
Google Scholar
P.M. Merlin & B. Randell. State restoration in distributed systems. In Proc. of 8th International Symposium on Fault-Tolerant Computing Systems, pages 129–134, Toulouse, June 1978.
Google Scholar
[Muller et al. 91] G. Muller, B. Rochat, & P. Sanchez. A stable transactional memory for building robust object oriented programs. In EuroMicro 91, pages 359–364, Vienne, Autriche, September 1991.
Google Scholar
B.J. Nelson. Remote Procedure Call. PhD thesis, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, 1981.
Google Scholar
B. Rochat. Une approche à la construction de services fiables dans les systèmes distribués. Phd. thesis, université de Rennes I (France), February 1992.
Google Scholar
[Rozier et al. 88] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, P. Léonard, S. Langlois, & W. Neuhauser. The Chorus distributed operating system. Computing Systems. 1(4):305–370, 1988.
Google Scholar
F. Schmuck & J. Wyllie, Experience with transactions in quicksilver. In ACM, Proc. of 13th ACM Symposium on Operating Systems Principles, pages 239–253, October 1991.
Google Scholar
L.M. Silva & J.G. Silva, Global checkpointing for distributed programs. Proc. of 11th Symposium on Reliable Distributed Systems, pages 155–162, 1992.
Google Scholar
[Singh et al. 91] J.P. Singh, W. Weber, & A. Gupta. Splash: Stanford parallel applications for shared-memory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.
Google Scholar
R.E. Strom & S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.
Google Scholar
Y. Tamir & C. Sequin. Error recovery in multicomputers using global checkpoints. In Proc. of 1984 International Conference on Parallel Processing, pages 32–41. August 1984.
Google Scholar
W.G. Wood. A decentralised recovery control protocol. Proc. of 11th International Symposium on Fault-Tolerant Computing Systems, pages 159–164, 1981.
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA/INRIA, Campus de Beaulieu, 35042, Rennes Cedex, France
Gilles Muller
BULL Research IRISA, Campus de Beaulieu, 35042, Rennes Cedex, France
Mireille Hue & Nadine Peyrouze

Authors

Gilles Muller
View author publications
You can also search for this author in PubMed Google Scholar
Mireille Hue
View author publications
You can also search for this author in PubMed Google Scholar
Nadine Peyrouze
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muller, G., Hue, M., Peyrouze, N. (1994). Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_154

Download citation

DOI: https://doi.org/10.1007/3-540-58426-9_154
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics