Deriving optimal checkpoint protocols for distributed shared memory architectures

Alvisi, Lorenzo; Marzullo, Keith

doi:10.1007/3-540-60042-6_8

Lorenzo Alvisi¹ &
Keith Marzullo²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 938))

124 Accesses
1 Citations

Abstract

Uncoordinated checkpointing is one technique used to build processes that can recover to a consistent state after crashing. This technique requires each process to periodically record its state in a checkpoint. Furthermore, the threads executing on each process log any non-deterministic action that they take following the latest checkpointed state. When a process crashes, a new process, initialized with the appropriate recorded local state, is created in its place. The new process restarts executing, and whenever one of its threads confronts a non-deterministic choice, the thread references the log in order to reproduce the same action performed before the crash. Thus, uncoordinated checkpointing implements an abstraction of a resilient process in which the crash of a process is translated into intermittent unavailability of that process.

We give a specification of the consistency property “no orphan threads” in the context of multithreaded processes running on a shared memory multiprocessor. We also give a definition of optimality for uncoordinated checkpointing protocols given a memory coherency protocol. We then use this specification to derive an existing uncoordinated checkpoint protocol and show that it is optimal. This protocol assumes that once a process crashes, no further processes crash until the first process completes recovery.

This author was supported in part by the Office of Naval Research under contract N00014-91-J-1219, the National Science Foundation under Grant No. CCR-9003440, DARPA/NSF Grant No. CCR-9014363, NASA/DARPA grant NAG-2-893, and AFOSR grant F49620-94-1-0198. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not reflect the views of these agencies.

This author was supported in part by the Defense Advanced Research Projects Agency (DoD) under NASA Ames grant number NAG 2-593, Contract N00140-87-C-8904 and by AFOSR grant number F49620-93-1-0242. The views, opinions, and findings contained in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lorenzo Alvisi and Keith Marzullo. Optimal Message Logging Protocols. Cornell University Department of Computer Science Technical Report TR 94-1457, September 1994.
Google Scholar
Lorenzo Alvisi and Keith Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. In Proceedings of the Fifteenth International Conference on Distributed Computing Systems. IEEE Computer Society, May 1995.
Google Scholar
B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The midway distributed shared memory system. In Proceedings of the 93 COMPCON Conference, pages 528–537. IEEE, February 1993.
Google Scholar
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of the 4th Workshop on Workstation Operating Systems, pages 142–149, October 1993.
Google Scholar
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.
Article Google Scholar
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241–248, September 1979.
Google Scholar
N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Thirteenth Symposium on Principles of Distributed Computing. ACM SIGACT/SIGOPS, August 1994.
Google Scholar
Fred B. Schneider. Byzantine generals in action: Implementing fail-stop processors. ACM Transactions on Computer Systems, 2(2):145–154, May 1984.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Cornell University, Ithaca, NY
Lorenzo Alvisi
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA
Keith Marzullo

Authors

Lorenzo Alvisi
View author publications
You can also search for this author in PubMed Google Scholar
Keith Marzullo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Kenneth P. Birman Friedemann Mattern André Schiper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alvisi, L., Marzullo, K. (1995). Deriving optimal checkpoint protocols for distributed shared memory architectures. In: Birman, K.P., Mattern, F., Schiper, A. (eds) Theory and Practice in Distributed Systems. Lecture Notes in Computer Science, vol 938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60042-6_8

Download citation

DOI: https://doi.org/10.1007/3-540-60042-6_8
Published: 05 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60042-8
Online ISBN: 978-3-540-49409-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics