Abstract
Nowadays, the distributed computing is prevailing in artificial intelligence applications due to the limited computation capacity of single computing node. Generally, distributed computing system contains large scale of computing node, and therefore system breakdown is regarded as usual matter. To enhance the system availability and performance, failure detection dominates important status to recover the system. The traditional failure detector simply equates the link fault with the node fault problem, which greatly affects the resource utilization, fault locating and fast repair. We present a self-adaptive Link-based Failure Detection Agreement DLFDA with an improved node fault detection algorithm, which can accurately distinguish the node fault and link fault. DLFDA can dynamically adjust the detection structure to increase the coverage of the link fault detection, while using Gossip protocol to distribute fault diagnosis results to other system members, which extensively reduces the damage of the system performance. Finally, the experimental results show that our method can meet the requirements of theoretical design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
He, Y., Jiang, X., Ye, K., Ma, R., Li, X.: HPACS: a high privacy and availability cloud storage platform with matrix encryption. In: Wu, C., Cohen, A. (eds.) APPT 2013. LNCS, vol. 8299, pp. 132–145. Springer, Heidelberg (2013). doi:10.1007/978-3-642-45293-2_10
He, Y., Jiang, X., Wu, Z., et al.: Scalability analysis and improvement of hadoop virtual cluster with cost consideration. In: IEEE 7th International Conference on Cloud Computing (CLOUD), pp. 594–601. IEEE Press, New York (2014)
Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. https://aws.amazon.com/cn/message/65648/
Microsoft apologizes for Outlook, ActiveSync downtime, says error overloaded servers. http://www.theverge.com/2013/8/17/4631622
Guerraoui, R., Hurfinn, M., Mostefaoui, A., Oliveira, R., Raynal, M., Schiper, A.: Consensus in asynchronous distributed systems: a concise guided tour. In: Krakowiak, S., Shrivastava, S. (eds.) Advances in Distributed Systems. LNCS, vol. 1752, pp. 33–47. Springer, Heidelberg (2000). doi:10.1007/3-540-46475-1_2
Sar, A., Akkaya, M.: Fault tolerance mechanisms in distributed systems. Int. J. Commun. Netw. Syst. Sci. 8, 471–482 (2015)
Pasin, M., Fontaine, S., Bouchenak, S.: Failure detection in large scale systems: a survey. In: Proceedings of IEEE Network Operations and Management Symposium Workshops, pp. 7–11. IEEE Press, New York (2008)
Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans. Program. Lang. Syst. (TOPLAS) 4(3), 382–401 (1982)
Satzger, B., Pietzowski, A., Trumler, W., et al.: A new adaptive accrual failure detector for dependable distributed systems. In: Proceedings of the 2007 ACM symposium on Applied computing, pp. 551–555. ACM Press, New York (2007)
Hayashibara, N., Defago, X., Yared, R., et al.: The φ accrual failure detector. In: Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, pp. 66–78. IEEE Press, New York (2004)
Hayashibara, N., Défago, X., Katayama, T.: Two-ways adaptive failure detection with the ϕ-failure detector. In: Workshop on Adaptive Distributed Systems (WADiS 2003), pp. 22–27. (2003)
Apache Cassandra: Apache Cassandra. http://planetcassandra.org/what-is-apache-cassandra
Das, A., Gupta, I., Motivala, A.: Swim: Scalable weakly-consistent infection-style process group membership protocol. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN 2002), pp. 303–312. IEEE Press, New York (2002)
Horita, Y., Taura, K., Chikayama, T.: A scalable and efficient self-organizing failure detector for grid applications. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, pp. 202–210. IEEE Computer Society, New York (2005)
Acknowledgments
This work is supported by National High Technology Research 863 Major Program of China (No. 2011AA01A207).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
He, Y., Jiang, X., Dai, C., Fan, Z. (2017). Self-adaptive Failure Detector for Peer-to-Peer Distributed System Considering the Link Faults. In: Dou, Y., Lin, H., Sun, G., Wu, J., Heras, D., Bougé, L. (eds) Advanced Parallel Processing Technologies. APPT 2017. Lecture Notes in Computer Science(), vol 10561. Springer, Cham. https://doi.org/10.1007/978-3-319-67952-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-67952-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67951-8
Online ISBN: 978-3-319-67952-5
eBook Packages: Computer ScienceComputer Science (R0)