Skip to main content
Log in

FRASystem: fault tolerant system using agents in distributed computing systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)

    Google Scholar 

  2. Capella, J.V., Serrano, J.J., Ors, R., Bonastre, A.: Enabling ubiquitous wireless sensor networks: a new fault tolerant RF architecture with perpetual electrical power based on 82.15.4 and RFID. In: Proc. IEEE International Workshop on Radio-Frequency Integration Technology, pp. 250–253, 2007

  3. Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message passing systems. CMU Technical Report CMU-CS-99-148 (1999)

  4. Nomica, I., Rao, I., Lee, Y.-K., Jeong, B.-S., Lee, S.: An enhanced coordinated checkpointing scheme for fault recovery in wireless mobile systems. In: Proc. International Conference on Ubiquitous Information Technologies & Applications, pp. 503–512, 2007

  5. Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 149–159 (1998)

    Article  Google Scholar 

  6. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks. IEEE Commun. Mag. (August), 102–114 (2002)

  7. Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)

    Google Scholar 

  8. Hadzilacos, V., Toueg, S.: A modular approach to fault-tolerant broadcasts and related problems. Technical report, Dept. of Computer Science, University of Toronto (1994)

  9. Cao, J., Chan, G.H., Jia, W., Dillon, T.S.: Checkpointing and rollback of wide-area distributed applications using mobile agents. In: Proc. Parallel and Distributed Processing Symposium, 2001

  10. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. SE-13 (1987)

  11. Coulouris, Dollimore, Kindberg: Distributed Systems: Concepts and Design, 3rd edn. Addison-Wesley, Reading (2001)

    Google Scholar 

  12. Tanenbaum, Van Steen: Distributed Systems: Principles and Paradigms. Prentice Hall, New York (2001)

    Google Scholar 

  13. Kumar, S., Cohen, P.R.: Towards a fault-tolerance multi-agent system architecture. In: Proc. the Fourth International Conference on Autonomous Agents (Agents 2000), Barcelona, Spain, June 3–7, 2000, pp. 459–466. ACM Press, New York (2000)

    Google Scholar 

  14. Sreenivas, M.V., Bhalla, S.: Garbage collection in message passing distributed systems. In: Proc. International Symposium on Parallel Algorithms/Architecture Synthesis, pp. 213–218. IEEE Computer Society Press, Los Alamitos, 1998

  15. Chung, K.S., Yu, H.-C., Baik, M.-S., Shon, J.G., Hwang, J.-S.: A garbage collection of message logs without additional message on causal message logging protocol. J. KISS: Comput. Syst. Theory 28(7–8), 331–340 (2001)

    Google Scholar 

  16. Alvisi, L.: Understanding the message logging paradigm for masking process crashes, Ph.D. Thesis, Department of Computer Science, Cornell University (1996)

  17. Bhargava, B., Lian, S.: Independent checkpointing and concurrent rollback for recovery—an optimistic approach. In: Proc. the Symposium on Reliable Distributed Systems, pp. 3–12, 1988

  18. Neves, N., Fuchs, W.K.: Coordinated checkpointing without direct coordination. In: Proc. the IEEE International Computer Performance and Dependability Symposium, pp. 23–31, 1998

  19. Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Computer Performance and Dependability Symposium, pp. 207–215, 1984

  20. Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. In: Proc. the Seventeenth IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 10–18, 1998

  21. Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)

    Article  Google Scholar 

  22. Elnozahy, E.N.: Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication, Ph.D. Thesis, Rice University (1993)

  23. Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant distributed computing systems. ACM Trans. Comput. Syst. 1, 222–238 (1985)

    Article  Google Scholar 

  24. Lamport, L.: Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)

    Article  MATH  Google Scholar 

  25. Caglayan, A., Harrison, C.: AGENT: Source Book. Wiley, New York (1997)

    Google Scholar 

  26. Genesereth, M.R., Ketchpel, S.P.: Software agents. Commun. ACM 37(7), 48–53 (1994)

    Article  Google Scholar 

  27. Finin, T., Fritzson, R., Mckay, D., McEntire, R.: KQML as an agent communication language. In: Proc .CIKM ’94, pp. 126–130, 1994

  28. Genesereth, M., Fikes, R.: Knowledge interchange format version 3.0 reference manual. Technical Report Logic-92-1, Computer Science Department, Stanford University (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HwaMin Lee.

Additional information

This work was supported by the Soonchunhyang University Research Fund 20080152.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, H., Park, D., Yu, H. et al. FRASystem: fault tolerant system using agents in distributed computing systems. Cluster Comput 14, 15–25 (2011). https://doi.org/10.1007/s10586-009-0095-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-009-0095-x

Keywords

Navigation