Abstract
This paper presents an evaluation of the performance of a consistent checkpointing mechanism that has been integrated into a modular Mach microkernel based operating system. We have measured the performance overhead of checkpointing for several workstation-typical applications: number crunching and office tools. This has been done using specific servers which were added to a standard Mach 3.0/BSD system. Measurements are performed for failure-free executions by varying the number of checkpoints and thus the amount of computation lost in the event of a crash. Our initial results showed a time overhead of about 3% for up to 20% work lost in the event of a crash. while we get an overhead between 16% and 23% for up to 1% computation lost. Also, when porting interactive office tools such as the micro-emacs text editor, we get a maximal checkpoint duration of 1.4 second on our prototype machine that is as powerful as a Sun 3/60. Based on these results, we argue that checkpointing can be integrated into a modular micro-kernel based operating system without degradation of the system performances.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
[Accetta et al. 86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, & M. Young, Mach: A new kernel foundation for Unix development. In Proc. of Usenix 1986 Summer Conference, pages 93–112, July 1986.
M. Ahamad & L. Lin. Using checkpoints to localize the effects of faults in distributed systems. Proc. of 8th Symposium on Reliable Distributed Systems, pages 2–11, 1989.
[Banâtre et al. 86] J.P. Banâtre, M. Banâtre, G. Lapalme, & Fl. Ployette. The design and building of enchere, a distributed electronic marketing system. Communications of the ACM, 29(1):19–29, January 1986.
[Banâtre et al. 88] J.P. Banâtre, M. Banâtre, & G. Muller. Ensuring data security and integrity with a fast stable storage. In Proc. of 4th International Conference on Data Engineering, pages 285–293, Los Angeles, February 1988.
[Banâtre et al. 91] M. Banâtre, G. Muller, B. Rochat, & P. Sanchez. Design decisions for the FTM: A general purpose fault tolerant machine. In Proc. of 21st International Symposium on Fault-Tolerant Computing Systems, pages 71–78, Montréal, Canada, June 1991.
[Banâtre et al. 93] M. Banâtre, P. Heng, G. Muller, N. Peyrouze, & B. Rochat. An experience in the design of a reliable object based system, In Proc. of the 2th Conference on Parallel and Distributed Information Systems, San Diego, California, January 1993.
B. Bhargava & S.R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In Proc. of 7th Symposium on Reliable Distributed Systems. pages 3–12, 1988.
[Borg et al. 89] A. Borg, W. Blau, W. Graetsch, F. Herrmann, & W. Oberle, Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.
K.M. Chandy & L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.
F. Cristian & F. Jahanian. A timestamp-based checkpointing protocol for long-lived distributed computations. Proc. of 10th Symposium on Reliable Distributed Systems, pages 12–20, September 1991.
E.N. Elnozahy & W. Zwaenepoel. Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers, 41(5):526–531, May 1992.
[Elnozahy et al. 92] E.N. Elnozahy, D.B. Johnson, & W. Zwaenepoel, The performance of consistent checkpointing. Proc. of 11th Symposium on Reliable Distributed Systems. pages 39–47, 1992.
Gazelle Microcircuits, Inc, Santa Clara (CA). Hot Rod High Speed Serial Link Data Sheet, 1990.
B.J. Gleeson, Fault tolerance: Why should i pay for it. In M. Banâtre & P.A. Lee, éditeurs, Workshop on Hardware and Software Architectures for Fault Tolerance: Perspective and Towards a Synthesis, volume 774 of Lecture Notes in Computer Science, pages 66–77, Le Mont Saint-Michel (France), June 1993.
[Goldberg et al. 90] A. Goldberg, A. Gopal, K. Li, R. Strom, & D.F. Bacon. Transparent recovery of mach applications. In USENIX Mach Workshop. pages 169–183, Burlington (Vermont), October 1990.
J. Gray, Notes on Database Operating Systems., volume 60 of Lecture Notes in Computer Science. Springer Verlag, 1978.
[Hue et al. 93] M. Hue, G. Muller, N. Peyrouze, & B. Rochat, Implementing dynamic atomic actions using reliable servers. In Proceedings of Esprit Basic Research Project 6360, Broadcast, First Year Report, volume 3, October 1993.
T.TY Juang & S. Venkatesan. Crash recovery with little overhead. In Proc. of 13th International Conference on Distributed Computing Systems. pages 454–452,1991.
R. Koo & S. Toueg, Checkpointing and rollback recovery for distributed systems. In Proc. of Fall Joint Computer Conference, pages 1150–1158, Dallas, 1986.
B. Lampson. Atomic transactions. In Distributed Systems and Architecture and Implementation: an Advanced Course, volume 105 of Lecture Notes in Computer Science, pages 246–265. Springer Verlag. 1981.
P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154–163, Loas Angeles (CA), February 1988.
[Li et al. 91] K. Li, J.F. Naughton & J.S. Plank. Checkpointing multicomputer applications. Proc. of 10th Symposium on Reliable Distributed Systems, pages 1–10, 1991.
P.M. Merlin & B. Randell. State restoration in distributed systems. In Proc. of 8th International Symposium on Fault-Tolerant Computing Systems, pages 129–134, Toulouse, June 1978.
[Muller et al. 91] G. Muller, B. Rochat, & P. Sanchez. A stable transactional memory for building robust object oriented programs. In EuroMicro 91, pages 359–364, Vienne, Autriche, September 1991.
B.J. Nelson. Remote Procedure Call. PhD thesis, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, 1981.
B. Rochat. Une approche à la construction de services fiables dans les systèmes distribués. Phd. thesis, université de Rennes I (France), February 1992.
[Rozier et al. 88] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, P. Léonard, S. Langlois, & W. Neuhauser. The Chorus distributed operating system. Computing Systems. 1(4):305–370, 1988.
F. Schmuck & J. Wyllie, Experience with transactions in quicksilver. In ACM, Proc. of 13th ACM Symposium on Operating Systems Principles, pages 239–253, October 1991.
L.M. Silva & J.G. Silva, Global checkpointing for distributed programs. Proc. of 11th Symposium on Reliable Distributed Systems, pages 155–162, 1992.
[Singh et al. 91] J.P. Singh, W. Weber, & A. Gupta. Splash: Stanford parallel applications for shared-memory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.
R.E. Strom & S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.
Y. Tamir & C. Sequin. Error recovery in multicomputers using global checkpoints. In Proc. of 1984 International Conference on Parallel Processing, pages 32–41. August 1984.
W.G. Wood. A decentralised recovery control protocol. Proc. of 11th International Symposium on Fault-Tolerant Computing Systems, pages 159–164, 1981.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Muller, G., Hue, M., Peyrouze, N. (1994). Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_154
Download citation
DOI: https://doi.org/10.1007/3-540-58426-9_154
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58426-1
Online ISBN: 978-3-540-48785-2
eBook Packages: Springer Book Archive