Skip to main content
Log in

Simulative performance analysis of gossip failure detection for scalable distributed systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (December 1993) 37-53.

    Article  Google Scholar 

  2. N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second Local Area Network, IEEE Micro 15(1) (February 1995) 26-36.

    Article  Google Scholar 

  3. F. Brasileiro, P. Ezhilchelvan, S. Shrivastava, N. Speirs and S. Tao, Implementing fail-silent nodes for distributed systems, IEEE Transactions on Computers 45(11) (November 1996) 1226-1238.

    Article  MATH  Google Scholar 

  4. T. Chandra, V. Hadzilacos and S. Toueg, The weakest failure detector for solving consensus, Journal of the ACM 43(4) (July 1996) 685-722.

    Article  MATH  MathSciNet  Google Scholar 

  5. T. Chandra, V. Hadzilacos, S. Toueg and B. Charron-Bost, Impossibility of group membership in asynchronous systems, in: Proceedings of the 15th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, Philadelphia, PA (May 1996) pp. 322-330.

  6. D. Dolev and D. Malki, The design of the transis system, Lecture Notes in Computer Science, Vol. 938 (Springer, Berlin, 1995) pp. 83-98.

    Google Scholar 

  7. D. Dolev and D. Malki, The transis approach to high availability cluster communication, Technical Report 94-14, Computer Science Institute, Hebrew University, Jerusalem, Israel (1995).

    Google Scholar 

  8. P. Melliar-Smith, L. Moser and V. Agrawala, Broadcast protocols for distributed systems, IEEE Transactions on Parallel and Distributed Systems 1(1) (January 1990) 17-25.

    Article  Google Scholar 

  9. S. Mullender, Distributed Systems (ACM Press, New York, 1989).

    Google Scholar 

  10. Myricom, Myrinet link specification, http://www.myri.com/scs/documentation/link/index.html (1995).

  11. G. Pfister, In Search of Clusters, 2nd edition (Prentice-Hall, Upper Saddle River, NJ, 1998).

    Google Scholar 

  12. D. Pradhan, Fault-Tolerant Computer System Design (Prentice-Hall, Upper Saddle River, NJ, 1996).

    Google Scholar 

  13. Sandia National Laboratories, Computational plant, http://rocs-pc.ca.sandia.gov/CPlant/CPlant.html (1998).

  14. R. Schlichting and F. Schneider, Fail-stop processors: An approach to designing fault-tolerant computing systems, ACM Transactions on Computing Systems 1(3) (August 1983) 222-238.

    Article  Google Scholar 

  15. K. Shanmugan, V. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (February 1992) 83-94.

    Google Scholar 

  16. R. Van Renesse, T. Hickey and K. Birman, Design and performance of horus: A lightweight group communications system, Technical Report 94-1442, Department of Computer Science, Cornell University, Ithaca, NY (December 1994).

    Google Scholar 

  17. R. Van Renesse, K. Birman, B. Glade, K. Guo, M. Hayden, T. Hickey, D. Malki, A. Vaysburd and W. Vogels, Horus: A flexible group communication system, Department of Computer Science, Cornell University, Ithaca, NY (March 1993).

    Google Scholar 

  18. R. Van Renesse, Y. Minsky and M. Hayden, A gossip-style failure detection service, in: IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware' 98), The Lake District, England (September 15–18, 1998).

    Google Scholar 

  19. VITA Standards Organization, Myrinet-on-VME protocol specification draft standard, http://www.vita.com/vso/draftstd/myri-vme-d05.pdf (1998).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burns, M.W., George, A.D. & Wallace, B.A. Simulative performance analysis of gossip failure detection for scalable distributed systems. Cluster Computing 2, 207–217 (1999). https://doi.org/10.1023/A:1019086910915

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019086910915

Keywords

Navigation