Skip to main content

Fault tolerance for clusters of workstations

  • Software Architectures for Fault Tolerance
  • Conference paper
  • First Online:
Hardware and Software Architectures for Fault Tolerance (Fault Tolerance 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Abstract

This paper presents a short description of the Manetho system, which provides fault tolerance for parallel application programs that execute on a cluster of workstations. Manetho uses a combination of rollback-recovery and process replication. Both methods are application-transparent, making it possible to automatically provide fault tolerance for existing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahamad, M., Dasgupta, P., LeBlanc, R.: Fault-tolerant atomic computations in an object-based distributed system. Distributed Computing 4 (1990) 69–80

    Article  Google Scholar 

  2. Amir, Y., Dolev, D., Kramer, S., Malki, D.: Transis: A communication subsystem for high availability. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 76–84

    Google Scholar 

  3. Birman K.: Replication and fault-tolerance in the ISIS system. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 79–86

    Google Scholar 

  4. Birman, K., Schiper, A., Stephenson, P.: Fast causal multicast. Technical Report TR-1105, Cornell University (1990)

    Google Scholar 

  5. Chang, J., Maxemchuck, N: Reliable broadcast protocols. ACM Transactions on Computer Systems, 2 (1984) 251–273

    Article  Google Scholar 

  6. Cooper, E.: Replicated distributed programs. Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985) 63–78

    Google Scholar 

  7. Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers 41 (1982) 526–531

    Article  Google Scholar 

  8. Elnozahy, E., Zwaenepoel, W.: Replicated distributed processes in Manetho. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1982) 18–27

    Google Scholar 

  9. Johnson, D.: Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Rice University (1989)

    Google Scholar 

  10. Juang, T., Venkatesan, S.: Crash recovery with little overhead. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 454–461

    Google Scholar 

  11. Kaashoek, F.: Group Communication in Distributed Computer Systems. PhD thesis, Vrije Universiteit (1992)

    Google Scholar 

  12. Kaashoek, F., Tanenbaum, A.: Group communication in the Amoeba distributed operating system. Proceedings of the 11th International Conference on Distributed Computing Systems (1991) 222–230

    Google Scholar 

  13. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (1978) 558–565

    Article  Google Scholar 

  14. Liskov, B.: Distributed programming in Argus. Communications of the ACM 31 (1988) 300–312

    Article  Google Scholar 

  15. Melliar-Smith, P., Moser, L.: Broadcast protocols for distributed systems. IEEE Transactions on Parallel and Distributed Systems, 1 (1990) 17–25

    Article  Google Scholar 

  16. Mishra, S., Peterson, L., Schlichting, R.: Implementing fault-tolerant replicated objects using Psync. Proceedings of the 8th Symposium on Reliable Distributed Systems (1989) 42–52

    Google Scholar 

  17. Mishra, S., Schlichting, R.: Abstractions for constructing dependable distributed systems. Technical Report TR92-19, University of Arizona (1992)

    Google Scholar 

  18. Peterson, L., Bucholz, N., Schlichting, R.: Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems 7 (1989) 217–246

    Article  Google Scholar 

  19. Schlichting, R., Schneider, F.: Fail-stop processors: An approach to designing faulttolerant computing systems. Transactions on Computer Systems 1 (1983) 222–238

    Article  Google Scholar 

  20. Sistla, A., Welch, J.: Efficient distributed recovery using message logging. Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (1989) 223–238

    Google Scholar 

  21. Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3 (1985) 204–226

    Article  Google Scholar 

  22. Verússimo, P., Rodrigues, L., Baptista, M.: A highly parallel atomic multicast protocol. Proceedings of the SIGCOMM '89 Symposium (1989) 83–93

    Google Scholar 

  23. Wang, Y-M., Fuchs, W.: Scheduling message processing for reducing rollback propagation. Proceedings of the 22nd International Symposium on Fault-Tolerant Computing (1992) 204–211

    Google Scholar 

  24. Wood, M.: Replicated RPC using Amoeba closed group communication. Proceedings of the Thirteenth International Conference on Distributed Computing Systems (1993) 499–507

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Elnozahy, E.N. (1994). Fault tolerance for clusters of workstations. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020030

Download citation

  • DOI: https://doi.org/10.1007/BFb0020030

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics