Abstract
This paper presents a short description of the Manetho system, which provides fault tolerance for parallel application programs that execute on a cluster of workstations. Manetho uses a combination of rollback-recovery and process replication. Both methods are application-transparent, making it possible to automatically provide fault tolerance for existing applications.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Preview
Unable to display preview. Download preview PDF.
References
Ahamad, M., Dasgupta, P., LeBlanc, R.: Fault-tolerant atomic computations in an object-based distributed system. Distributed Computing 4 (1990) 69–80
Amir, Y., Dolev, D., Kramer, S., Malki, D.: Transis: A communication subsystem for high availability. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 76–84
Birman K.: Replication and fault-tolerance in the ISIS system. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 79–86
Birman, K., Schiper, A., Stephenson, P.: Fast causal multicast. Technical Report TR-1105, Cornell University (1990)
Chang, J., Maxemchuck, N: Reliable broadcast protocols. ACM Transactions on Computer Systems, 2 (1984) 251–273
Cooper, E.: Replicated distributed programs. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 63–78
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers 41 (1982) 526–531
Elnozahy, E., Zwaenepoel, W.: Replicated distributed processes in Manetho. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1982) 18–27
Johnson, D.: Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Rice University (1989)
Juang, T., Venkatesan, S.: Crash recovery with little overhead. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 454–461
Kaashoek, F.: Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit (1992)
Kaashoek, F., Tanenbaum, A.: Group communication in the Amoeba distributed operating system. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 222–230
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (1978) 558–565
Liskov, B.: Distributed programming in Argus. Communications of the ACM 31 (1988) 300–312
Melliar-Smith, P., Moser, L.: Broadcast protocols for distributed systems. IEEE Transactions on Parallel and Distributed Systems, 1 (1990) 17–25
Mishra, S., Peterson, L., Schlichting, R.: Implementing fault-tolerant replicated objects using Psync. Proceedings of the 8th Symposium on Reliable Distributed Systems (1989) 42–52
Mishra, S., Schlichting, R.: Abstractions for constructing dependable distributed systems. Technical Report TR92-19, University of Arizona (1992)
Peterson, L., Bucholz, N., Schlichting, R.: Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems 7 (1989) 217–246
Schlichting, R., Schneider, F.: Fail-stop processors: An approach to designing faulttolerant computing systems. Transactions on Computer Systems 1 (1983) 222–238
Sistla, A., Welch, J.: Efficient distributed recovery using message logging. Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (1989) 223–238
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3 (1985) 204–226
Verússimo, P., Rodrigues, L., Baptista, M.: A highly parallel atomic multicast protocol. Proceedings of the SIGCOMM '89 Symposium (1989) 83–93
Wang, Y-M., Fuchs, W.: Scheduling message processing for reducing rollback propagation. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 204–211
Wood, M.: Replicated RPC using Amoeba closed group communication. Proceedings of the Thirteenth International Conference on Distributed Computing Systems (1993) 499–507
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elnozahy, E.N. (1994). Fault tolerance for clusters of workstations. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020030
Download citation
DOI: https://doi.org/10.1007/BFb0020030
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive