Simulative performance analysis of gossip failure detection for scalable distributed systems

Burns, Mark W.; George, Alan D.; Wallace, Bradley A.

doi:10.1023/A:1019086910915

Simulative performance analysis of gossip failure detection for scalable distributed systems

Published: November 1999

Volume 2, pages 207–217, (1999)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Mark W. Burns¹,
Alan D. George¹ &
Bradley A. Wallace¹

83 Accesses
6 Citations
Explore all metrics

Abstract

Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A brief introduction to distributed systems

Article Open access 16 August 2016

Maarten van Steen & Andrew S. Tanenbaum

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

Article 12 April 2024

Chanki Kim & Kang-Wook Chon

A Review of Distributed Ledger Technologies

References

K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (December 1993) 37-53.
Article Google Scholar
N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A gigabit-per-second Local Area Network, IEEE Micro 15(1) (February 1995) 26-36.
Article Google Scholar
F. Brasileiro, P. Ezhilchelvan, S. Shrivastava, N. Speirs and S. Tao, Implementing fail-silent nodes for distributed systems, IEEE Transactions on Computers 45(11) (November 1996) 1226-1238.
Article MATH Google Scholar
T. Chandra, V. Hadzilacos and S. Toueg, The weakest failure detector for solving consensus, Journal of the ACM 43(4) (July 1996) 685-722.
Article MATH MathSciNet Google Scholar
T. Chandra, V. Hadzilacos, S. Toueg and B. Charron-Bost, Impossibility of group membership in asynchronous systems, in: Proceedings of the 15th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, Philadelphia, PA (May 1996) pp. 322-330.
D. Dolev and D. Malki, The design of the transis system, Lecture Notes in Computer Science, Vol. 938 (Springer, Berlin, 1995) pp. 83-98.
Google Scholar
D. Dolev and D. Malki, The transis approach to high availability cluster communication, Technical Report 94-14, Computer Science Institute, Hebrew University, Jerusalem, Israel (1995).
Google Scholar
P. Melliar-Smith, L. Moser and V. Agrawala, Broadcast protocols for distributed systems, IEEE Transactions on Parallel and Distributed Systems 1(1) (January 1990) 17-25.
Article Google Scholar
S. Mullender, Distributed Systems (ACM Press, New York, 1989).
Google Scholar
Myricom, Myrinet link specification, http://www.myri.com/scs/documentation/link/index.html (1995).
G. Pfister, In Search of Clusters, 2nd edition (Prentice-Hall, Upper Saddle River, NJ, 1998).
Google Scholar
D. Pradhan, Fault-Tolerant Computer System Design (Prentice-Hall, Upper Saddle River, NJ, 1996).
Google Scholar
Sandia National Laboratories, Computational plant, http://rocs-pc.ca.sandia.gov/CPlant/CPlant.html (1998).
R. Schlichting and F. Schneider, Fail-stop processors: An approach to designing fault-tolerant computing systems, ACM Transactions on Computing Systems 1(3) (August 1983) 222-238.
Article Google Scholar
K. Shanmugan, V. Frost and W. LaRue, A block-oriented network simulator (BONeS), Simulation 58(2) (February 1992) 83-94.
Google Scholar
R. Van Renesse, T. Hickey and K. Birman, Design and performance of horus: A lightweight group communications system, Technical Report 94-1442, Department of Computer Science, Cornell University, Ithaca, NY (December 1994).
Google Scholar
R. Van Renesse, K. Birman, B. Glade, K. Guo, M. Hayden, T. Hickey, D. Malki, A. Vaysburd and W. Vogels, Horus: A flexible group communication system, Department of Computer Science, Cornell University, Ithaca, NY (March 1993).
Google Scholar
R. Van Renesse, Y. Minsky and M. Hayden, A gossip-style failure detection service, in: IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware' 98), The Lake District, England (September 15–18, 1998).
Google Scholar
VITA Standards Organization, Myrinet-on-VME protocol specification draft standard, http://www.vita.com/vso/draftstd/myri-vme-d05.pdf (1998).

Download references

Author information

Authors and Affiliations

High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, P.O. Box 116200, Gainesville, FL, 32611-6200, USA
Mark W. Burns, Alan D. George & Bradley A. Wallace

Authors

Mark W. Burns
View author publications
You can also search for this author in PubMed Google Scholar
Alan D. George
View author publications
You can also search for this author in PubMed Google Scholar
Bradley A. Wallace
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burns, M.W., George, A.D. & Wallace, B.A. Simulative performance analysis of gossip failure detection for scalable distributed systems. Cluster Computing 2, 207–217 (1999). https://doi.org/10.1023/A:1019086910915

Download citation

Issue Date: November 1999
DOI: https://doi.org/10.1023/A:1019086910915

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Simulative performance analysis of gossip failure detection for scalable distributed systems

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

A Review of Distributed Ledger Technologies

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Simulative performance analysis of gossip failure detection for scalable distributed systems

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems

A Review of Distributed Ledger Technologies

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation