Fault tolerance for clusters of workstations

Elnozahy, Elmootazbellah N.

doi:10.1007/BFb0020030

Elmootazbellah N. Elnozahy¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Workshop on Fault Tolerance

156 Accesses
3 Citations

Abstract

This paper presents a short description of the Manetho system, which provides fault tolerance for parallel application programs that execute on a cluster of workstations. Manetho uses a combination of rollback-recovery and process replication. Both methods are application-transparent, making it possible to automatically provide fault tolerance for existing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahamad, M., Dasgupta, P., LeBlanc, R.: Fault-tolerant atomic computations in an object-based distributed system. Distributed Computing 4 (1990) 69–80
Article Google Scholar
Amir, Y., Dolev, D., Kramer, S., Malki, D.: Transis: A communication subsystem for high availability. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 76–84
Google Scholar
Birman K.: Replication and fault-tolerance in the ISIS system. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 79–86
Google Scholar
Birman, K., Schiper, A., Stephenson, P.: Fast causal multicast. Technical Report TR-1105, Cornell University (1990)
Google Scholar
Chang, J., Maxemchuck, N: Reliable broadcast protocols. ACM Transactions on Computer Systems, 2 (1984) 251–273
Article Google Scholar
Cooper, E.: Replicated distributed programs. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 63–78
Google Scholar
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers 41 (1982) 526–531
Article Google Scholar
Elnozahy, E., Zwaenepoel, W.: Replicated distributed processes in Manetho. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1982) 18–27
Google Scholar
Johnson, D.: Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Rice University (1989)
Google Scholar
Juang, T., Venkatesan, S.: Crash recovery with little overhead. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 454–461
Google Scholar
Kaashoek, F.: Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit (1992)
Google Scholar
Kaashoek, F., Tanenbaum, A.: Group communication in the Amoeba distributed operating system. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 222–230
Google Scholar
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (1978) 558–565
Article Google Scholar
Liskov, B.: Distributed programming in Argus. Communications of the ACM 31 (1988) 300–312
Article Google Scholar
Melliar-Smith, P., Moser, L.: Broadcast protocols for distributed systems. IEEE Transactions on Parallel and Distributed Systems, 1 (1990) 17–25
Article Google Scholar
Mishra, S., Peterson, L., Schlichting, R.: Implementing fault-tolerant replicated objects using Psync. Proceedings of the 8th Symposium on Reliable Distributed Systems (1989) 42–52
Google Scholar
Mishra, S., Schlichting, R.: Abstractions for constructing dependable distributed systems. Technical Report TR92-19, University of Arizona (1992)
Google Scholar
Peterson, L., Bucholz, N., Schlichting, R.: Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems 7 (1989) 217–246
Article Google Scholar
Schlichting, R., Schneider, F.: Fail-stop processors: An approach to designing faulttolerant computing systems. Transactions on Computer Systems 1 (1983) 222–238
Article Google Scholar
Sistla, A., Welch, J.: Efficient distributed recovery using message logging. Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (1989) 223–238
Google Scholar
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3 (1985) 204–226
Article Google Scholar
Verússimo, P., Rodrigues, L., Baptista, M.: A highly parallel atomic multicast protocol. Proceedings of the SIGCOMM '89 Symposium (1989) 83–93
Google Scholar
Wang, Y-M., Fuchs, W.: Scheduling message processing for reducing rollback propagation. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 204–211
Google Scholar
Wood, M.: Replicated RPC using Amoeba closed group communication. Proceedings of the Thirteenth International Conference on Distributed Computing Systems (1993) 499–507
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
Elmootazbellah N. Elnozahy

Authors

Elmootazbellah N. Elnozahy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elnozahy, E.N. (1994). Fault tolerance for clusters of workstations. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020030

Download citation

DOI: https://doi.org/10.1007/BFb0020030
Published: 10 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics