Abstract
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.
Similar content being viewed by others
References
K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (December 1993) 37-53.
N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second Local Area Network, IEEE Micro 15(1) (February 1995) 26-36.
F. Brasileiro, P. Ezhilchelvan, S. Shrivastava, N. Speirs and S. Tao, Implementing fail-silent nodes for distributed systems, IEEE Transactions on Computers 45(11) (November 1996) 1226-1238.
T. Chandra, V. Hadzilacos and S. Toueg, The weakest failure detector for solving consensus, Journal of the ACM 43(4) (July 1996) 685-722.
T. Chandra, V. Hadzilacos, S. Toueg and B. Charron-Bost, Impossibility of group membership in asynchronous systems, in: Proceedings of the 15th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, Philadelphia, PA (May 1996) pp. 322-330.
D. Dolev and D. Malki, The design of the transis system, Lecture Notes in Computer Science, Vol. 938 (Springer, Berlin, 1995) pp. 83-98.
D. Dolev and D. Malki, The transis approach to high availability cluster communication, Technical Report 94-14, Computer Science Institute, Hebrew University, Jerusalem, Israel (1995).
P. Melliar-Smith, L. Moser and V. Agrawala, Broadcast protocols for distributed systems, IEEE Transactions on Parallel and Distributed Systems 1(1) (January 1990) 17-25.
S. Mullender, Distributed Systems (ACM Press, New York, 1989).
Myricom, Myrinet link specification, http://www.myri.com/scs/documentation/link/index.html (1995).
G. Pfister, In Search of Clusters, 2nd edition (Prentice-Hall, Upper Saddle River, NJ, 1998).
D. Pradhan, Fault-Tolerant Computer System Design (Prentice-Hall, Upper Saddle River, NJ, 1996).
Sandia National Laboratories, Computational plant, http://rocs-pc.ca.sandia.gov/CPlant/CPlant.html (1998).
R. Schlichting and F. Schneider, Fail-stop processors: An approach to designing fault-tolerant computing systems, ACM Transactions on Computing Systems 1(3) (August 1983) 222-238.
K. Shanmugan, V. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (February 1992) 83-94.
R. Van Renesse, T. Hickey and K. Birman, Design and performance of horus: A lightweight group communications system, Technical Report 94-1442, Department of Computer Science, Cornell University, Ithaca, NY (December 1994).
R. Van Renesse, K. Birman, B. Glade, K. Guo, M. Hayden, T. Hickey, D. Malki, A. Vaysburd and W. Vogels, Horus: A flexible group communication system, Department of Computer Science, Cornell University, Ithaca, NY (March 1993).
R. Van Renesse, Y. Minsky and M. Hayden, A gossip-style failure detection service, in: IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware' 98), The Lake District, England (September 15–18, 1998).
VITA Standards Organization, Myrinet-on-VME protocol specification draft standard, http://www.vita.com/vso/draftstd/myri-vme-d05.pdf (1998).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Burns, M.W., George, A.D. & Wallace, B.A. Simulative performance analysis of gossip failure detection for scalable distributed systems. Cluster Computing 2, 207–217 (1999). https://doi.org/10.1023/A:1019086910915
Issue Date:
DOI: https://doi.org/10.1023/A:1019086910915