Abstract
In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.
Similar content being viewed by others
References
Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)
Capella, J.V., Serrano, J.J., Ors, R., Bonastre, A.: Enabling ubiquitous wireless sensor networks: a new fault tolerant RF architecture with perpetual electrical power based on 82.15.4 and RFID. In: Proc. IEEE International Workshop on Radio-Frequency Integration Technology, pp. 250–253, 2007
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message passing systems. CMU Technical Report CMU-CS-99-148 (1999)
Nomica, I., Rao, I., Lee, Y.-K., Jeong, B.-S., Lee, S.: An enhanced coordinated checkpointing scheme for fault recovery in wireless mobile systems. In: Proc. International Conference on Ubiquitous Information Technologies & Applications, pp. 503–512, 2007
Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 149–159 (1998)
Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks. IEEE Commun. Mag. (August), 102–114 (2002)
Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)
Hadzilacos, V., Toueg, S.: A modular approach to fault-tolerant broadcasts and related problems. Technical report, Dept. of Computer Science, University of Toronto (1994)
Cao, J., Chan, G.H., Jia, W., Dillon, T.S.: Checkpointing and rollback of wide-area distributed applications using mobile agents. In: Proc. Parallel and Distributed Processing Symposium, 2001
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. SE-13 (1987)
Coulouris, Dollimore, Kindberg: Distributed Systems: Concepts and Design, 3rd edn. Addison-Wesley, Reading (2001)
Tanenbaum, Van Steen: Distributed Systems: Principles and Paradigms. Prentice Hall, New York (2001)
Kumar, S., Cohen, P.R.: Towards a fault-tolerance multi-agent system architecture. In: Proc. the Fourth International Conference on Autonomous Agents (Agents 2000), Barcelona, Spain, June 3–7, 2000, pp. 459–466. ACM Press, New York (2000)
Sreenivas, M.V., Bhalla, S.: Garbage collection in message passing distributed systems. In: Proc. International Symposium on Parallel Algorithms/Architecture Synthesis, pp. 213–218. IEEE Computer Society Press, Los Alamitos, 1998
Chung, K.S., Yu, H.-C., Baik, M.-S., Shon, J.G., Hwang, J.-S.: A garbage collection of message logs without additional message on causal message logging protocol. J. KISS: Comput. Syst. Theory 28(7–8), 331–340 (2001)
Alvisi, L.: Understanding the message logging paradigm for masking process crashes, Ph.D. Thesis, Department of Computer Science, Cornell University (1996)
Bhargava, B., Lian, S.: Independent checkpointing and concurrent rollback for recovery—an optimistic approach. In: Proc. the Symposium on Reliable Distributed Systems, pp. 3–12, 1988
Neves, N., Fuchs, W.K.: Coordinated checkpointing without direct coordination. In: Proc. the IEEE International Computer Performance and Dependability Symposium, pp. 23–31, 1998
Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Computer Performance and Dependability Symposium, pp. 207–215, 1984
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. In: Proc. the Seventeenth IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 10–18, 1998
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)
Elnozahy, E.N.: Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication, Ph.D. Thesis, Rice University (1993)
Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant distributed computing systems. ACM Trans. Comput. Syst. 1, 222–238 (1985)
Lamport, L.: Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)
Caglayan, A., Harrison, C.: AGENT: Source Book. Wiley, New York (1997)
Genesereth, M.R., Ketchpel, S.P.: Software agents. Commun. ACM 37(7), 48–53 (1994)
Finin, T., Fritzson, R., Mckay, D., McEntire, R.: KQML as an agent communication language. In: Proc .CIKM ’94, pp. 126–130, 1994
Genesereth, M., Fikes, R.: Knowledge interchange format version 3.0 reference manual. Technical Report Logic-92-1, Computer Science Department, Stanford University (1992)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Soonchunhyang University Research Fund 20080152.
Rights and permissions
About this article
Cite this article
Lee, H., Park, D., Yu, H. et al. FRASystem: fault tolerant system using agents in distributed computing systems. Cluster Comput 14, 15–25 (2011). https://doi.org/10.1007/s10586-009-0095-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-009-0095-x