Skip to main content

Failure Detection Service for Large Scale Systems

  • Conference paper
Agent and Multi-Agent Systems: Technologies and Applications (KES-AMSTA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4496))

  • 775 Accesses

Abstract

This paper addresses the problem of building a failure detection service for large scale distributed systems, as well as multi-agent systems. It describes the failure detector mechanism and defines the roles it plays in the system. Afterwards, the key construction problems that are fundamental in the context of building the failure detection service are presented. Finally, a sketch of general framework for implementing such a service is described. The proposed failure detection service can be used by mobile agents as a crucial component for building fault-tolerant multi-agent systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge Engineering Review 10(2), 115–152 (1995)

    Article  Google Scholar 

  2. Alouini, I., Roy, P.V.: Fault-tolerant mobile agents in Mozart. In: 2nd International Symposium on Agent Systems and Applications (ASA2000) and 4th International Symposium on Mobile Agents (MA2000), Zurich, Switzerland (2000)

    Google Scholar 

  3. Dellarocas, C., Klein, M.: An experimental evaluation of domain-independent fault handling services in open multi-agent systems. In: Proceedings of the International Conference on Multi-Agent Systems (ICMAS-2000), July 2000, pp. 95–102 (2000)

    Google Scholar 

  4. Turner, P.J., Jennings, N.R.: Improving the scalability of multi-agent systems. In: Wagner, T.A., Rana, O.F. (eds.) AA-WS 2000. LNCS (LNAI), vol. 1887, pp. 246–262. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  5. Marin, O., Bertier, M., Sens, P.: Darx - a framework for the fault-tolerant support of agent software. In: Proceedings of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE 2003), pp. 406–417 (2003)

    Google Scholar 

  6. Ahmad, H.F., Suguri, H., Ali, A., Malik, S., Mugal, M., Shafiq, M.O., Tariq, A., Basharat, A.: Scalable fault tolerant agent grooming environment: Sage. In: AAMAS ’05: Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 125–126. ACM Press, New York (2005)

    Chapter  Google Scholar 

  7. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43(2), 225–267 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  8. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2), 374–382 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  9. van Renesse, R., Minsky, Y., Hayden, M.: A gossip-based failure detection service. In: Proc. of the Int. Conf. on Distributed Systems Platforms and Open Distributed Processing (Middleware), pp. 55–70 (1998)

    Google Scholar 

  10. Gupta, I., Chandra, T.D., Goldszmidt, G.: On scalable and efficient distributed failure detectors. In: Proc. of the 20th Annual Symp. on Principles of Distributed Computing (PODC), pp. 170–179 (2001)

    Google Scholar 

  11. Hayashibara, N., Cherif, A., Katayama, T.: Failure detectors for large-scale distributed systems. In: Proceeding of the 1st Workshop on Self-Repairing and Self-Configurable Distributed Systems (RCDS), Osaka, Japan, pp. 404–409 (2002)

    Google Scholar 

  12. Stelling, P., DeMatteis, C., Foster, I.T., Kesselman, C., Lee, C.A., von Laszewski, G.: A fault detection service for wide area distributed computations. Cluster Computing 2(2), 117–128 (1999)

    Article  Google Scholar 

  13. Horita, Y., Taura, K., Chikayama, T.: A scalable and efficient self-organizing failure detector for grid applications. In: Proceedings of 6th International Workshop on Grid Computing (Grid 2005), Seattle, Washington, USA, pp. 202–210 (2005)

    Google Scholar 

  14. Overeinder, B.J., Brazier, F.M.T., Marin, O.: Fault-tolerance in scalable agent support systems: Integrating darx in the agentscape framework. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), May 2003, pp. 688–695 (2003)

    Google Scholar 

  15. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. In: Proceedings of the 6th Annual ACM Symposium on Principles of Distributed Computing, Vancouver, BC, Canada, pp. 1–12. ACM Press, New York (1987)

    Google Scholar 

  16. Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detectors. IEEE Trans. Computers 51(5), 561–580 (2002)

    Article  MathSciNet  Google Scholar 

  17. Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’02), Washington, DC, pp. 354–363 (2002)

    Google Scholar 

  18. Hayashibara, N., Défago, X., Yared, R., Katayama, T.: The ϕ accrual failure detector. In: SRDS, pp. 66–78. IEEE Computer Society Press, Los Alamitos (2004)

    Google Scholar 

  19. Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal multicast. ACM Transactions on Computer Systems 17(2), 41–88 (1999)

    Article  Google Scholar 

  20. Eugster, P.T., Guerraoui, R., Handurukande, S.B., Kouznetsov, P., Kermarrec, A.-M.: Lightweight probabilistic broadcast. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN 2001), Washington, DC, USA, pp. 443–452. IEEE Computer Society Press, Los Alamitos (2001)

    Chapter  Google Scholar 

  21. Ganesh, A.J., Kermarrec, A.-M., Massoulié, L.: Peer-to-peer membership management for gossip-based protocols. IEEE Trans. Computers 52(2), 139–149 (2003)

    Article  Google Scholar 

  22. Jelasity, M., Guerraoui, R., Kermarrec, A.M., van Steen, M.: The peer sampling service: Experimental evaluation of unstructured gossip-based implementations. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 79–98. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ngoc Thanh Nguyen Adam Grzech Robert J. Howlett Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kobusiński, J. (2007). Failure Detection Service for Large Scale Systems. In: Nguyen, N.T., Grzech, A., Howlett, R.J., Jain, L.C. (eds) Agent and Multi-Agent Systems: Technologies and Applications. KES-AMSTA 2007. Lecture Notes in Computer Science(), vol 4496. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72830-6_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72830-6_70

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72829-0

  • Online ISBN: 978-3-540-72830-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics