Self-adaptive Failure Detector for Peer-to-Peer Distributed System Considering the Link Faults

He, Yanzhang; Jiang, Xiaohong; Dai, Changbo; Fan, Zikun

doi:10.1007/978-3-319-67952-5_6

Yanzhang He¹⁹,
Xiaohong Jiang¹⁹,
Changbo Dai¹⁹ &
…
Zikun Fan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10561))

Included in the following conference series:

International Workshop on Advanced Parallel Processing Technologies

844 Accesses
2 Citations

Abstract

Nowadays, the distributed computing is prevailing in artificial intelligence applications due to the limited computation capacity of single computing node. Generally, distributed computing system contains large scale of computing node, and therefore system breakdown is regarded as usual matter. To enhance the system availability and performance, failure detection dominates important status to recover the system. The traditional failure detector simply equates the link fault with the node fault problem, which greatly affects the resource utilization, fault locating and fast repair. We present a self-adaptive Link-based Failure Detection Agreement DLFDA with an improved node fault detection algorithm, which can accurately distinguish the node fault and link fault. DLFDA can dynamically adjust the detection structure to increase the coverage of the link fault detection, while using Gossip protocol to distribute fault diagnosis results to other system members, which extensively reduces the damage of the system performance. Finally, the experimental results show that our method can meet the requirements of theoretical design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Failure detection algorithm for Fail-Lagging model applied to HPC

Article 27 March 2022

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems

Article 13 January 2023

References

He, Y., Jiang, X., Ye, K., Ma, R., Li, X.: HPACS: a high privacy and availability cloud storage platform with matrix encryption. In: Wu, C., Cohen, A. (eds.) APPT 2013. LNCS, vol. 8299, pp. 132–145. Springer, Heidelberg (2013). doi:10.1007/978-3-642-45293-2_10
Chapter Google Scholar
He, Y., Jiang, X., Wu, Z., et al.: Scalability analysis and improvement of hadoop virtual cluster with cost consideration. In: IEEE 7th International Conference on Cloud Computing (CLOUD), pp. 594–601. IEEE Press, New York (2014)
Google Scholar
Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. https://aws.amazon.com/cn/message/65648/
Microsoft apologizes for Outlook, ActiveSync downtime, says error overloaded servers. http://www.theverge.com/2013/8/17/4631622
Guerraoui, R., Hurfinn, M., Mostefaoui, A., Oliveira, R., Raynal, M., Schiper, A.: Consensus in asynchronous distributed systems: a concise guided tour. In: Krakowiak, S., Shrivastava, S. (eds.) Advances in Distributed Systems. LNCS, vol. 1752, pp. 33–47. Springer, Heidelberg (2000). doi:10.1007/3-540-46475-1_2
Chapter Google Scholar
Sar, A., Akkaya, M.: Fault tolerance mechanisms in distributed systems. Int. J. Commun. Netw. Syst. Sci. 8, 471–482 (2015)
Google Scholar
Pasin, M., Fontaine, S., Bouchenak, S.: Failure detection in large scale systems: a survey. In: Proceedings of IEEE Network Operations and Management Symposium Workshops, pp. 7–11. IEEE Press, New York (2008)
Google Scholar
Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans. Program. Lang. Syst. (TOPLAS) 4(3), 382–401 (1982)
Article MATH Google Scholar
Satzger, B., Pietzowski, A., Trumler, W., et al.: A new adaptive accrual failure detector for dependable distributed systems. In: Proceedings of the 2007 ACM symposium on Applied computing, pp. 551–555. ACM Press, New York (2007)
Google Scholar
Hayashibara, N., Defago, X., Yared, R., et al.: The φ accrual failure detector. In: Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, pp. 66–78. IEEE Press, New York (2004)
Google Scholar
Hayashibara, N., Défago, X., Katayama, T.: Two-ways adaptive failure detection with the ϕ-failure detector. In: Workshop on Adaptive Distributed Systems (WADiS 2003), pp. 22–27. (2003)
Google Scholar
Apache Cassandra: Apache Cassandra. http://planetcassandra.org/what-is-apache-cassandra
Das, A., Gupta, I., Motivala, A.: Swim: Scalable weakly-consistent infection-style process group membership protocol. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN 2002), pp. 303–312. IEEE Press, New York (2002)
Google Scholar
Horita, Y., Taura, K., Chikayama, T.: A scalable and efficient self-organizing failure detector for grid applications. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, pp. 202–210. IEEE Computer Society, New York (2005)
Google Scholar

Download references

Acknowledgments

This work is supported by National High Technology Research 863 Major Program of China (No. 2011AA01A207).

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, 310027, China
Yanzhang He, Xiaohong Jiang, Changbo Dai & Zikun Fan

Authors

Yanzhang He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Changbo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Zikun Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohong Jiang .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Yong Dou
Delft University of Technology, Delft, The Netherlands
Haixiang Lin
Peking University, Beijing, China
Guangyu Sun
National University of Defense Technology, Changsha, China
Junjie Wu
CiTIUS, Santiago de Compostela, Spain
Dora Heras
ENS Rennes, Rennes, France
Luc Bougé

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Y., Jiang, X., Dai, C., Fan, Z. (2017). Self-adaptive Failure Detector for Peer-to-Peer Distributed System Considering the Link Faults. In: Dou, Y., Lin, H., Sun, G., Wu, J., Heras, D., Bougé, L. (eds) Advanced Parallel Processing Technologies. APPT 2017. Lecture Notes in Computer Science(), vol 10561. Springer, Cham. https://doi.org/10.1007/978-3-319-67952-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-67952-5_6
Published: 14 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67951-8
Online ISBN: 978-3-319-67952-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)