FRASystem: fault tolerant system using agents in distributed computing systems

Lee, HwaMin; Park, DooSoon; Yu, HeonChang; Lee, Giyeol

doi:10.1007/s10586-009-0095-x

FRASystem: fault tolerant system using agents in distributed computing systems

Published: 17 July 2009

Volume 14, pages 15–25, (2011)
Cite this article

Cluster Computing Aims and scope Submit manuscript

HwaMin Lee¹,
DooSoon Park¹,
HeonChang Yu² &
…
Giyeol Lee³

173 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)
Google Scholar
Capella, J.V., Serrano, J.J., Ors, R., Bonastre, A.: Enabling ubiquitous wireless sensor networks: a new fault tolerant RF architecture with perpetual electrical power based on 82.15.4 and RFID. In: Proc. IEEE International Workshop on Radio-Frequency Integration Technology, pp. 250–253, 2007
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message passing systems. CMU Technical Report CMU-CS-99-148 (1999)
Nomica, I., Rao, I., Lee, Y.-K., Jeong, B.-S., Lee, S.: An enhanced coordinated checkpointing scheme for fault recovery in wireless mobile systems. In: Proc. International Conference on Ubiquitous Information Technologies & Applications, pp. 503–512, 2007
Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 149–159 (1998)
Article Google Scholar
Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks. IEEE Commun. Mag. (August), 102–114 (2002)
Coulouris, G., Dollimore, J., Kindberg, T.: Distributed System Concepts and Design. Addison-Wesley, Reading (2000)
Google Scholar
Hadzilacos, V., Toueg, S.: A modular approach to fault-tolerant broadcasts and related problems. Technical report, Dept. of Computer Science, University of Toronto (1994)
Cao, J., Chan, G.H., Jia, W., Dillon, T.S.: Checkpointing and rollback of wide-area distributed applications using mobile agents. In: Proc. Parallel and Distributed Processing Symposium, 2001
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. SE-13 (1987)
Coulouris, Dollimore, Kindberg: Distributed Systems: Concepts and Design, 3rd edn. Addison-Wesley, Reading (2001)
Google Scholar
Tanenbaum, Van Steen: Distributed Systems: Principles and Paradigms. Prentice Hall, New York (2001)
Google Scholar
Kumar, S., Cohen, P.R.: Towards a fault-tolerance multi-agent system architecture. In: Proc. the Fourth International Conference on Autonomous Agents (Agents 2000), Barcelona, Spain, June 3–7, 2000, pp. 459–466. ACM Press, New York (2000)
Google Scholar
Sreenivas, M.V., Bhalla, S.: Garbage collection in message passing distributed systems. In: Proc. International Symposium on Parallel Algorithms/Architecture Synthesis, pp. 213–218. IEEE Computer Society Press, Los Alamitos, 1998
Chung, K.S., Yu, H.-C., Baik, M.-S., Shon, J.G., Hwang, J.-S.: A garbage collection of message logs without additional message on causal message logging protocol. J. KISS: Comput. Syst. Theory 28(7–8), 331–340 (2001)
Google Scholar
Alvisi, L.: Understanding the message logging paradigm for masking process crashes, Ph.D. Thesis, Department of Computer Science, Cornell University (1996)
Bhargava, B., Lian, S.: Independent checkpointing and concurrent rollback for recovery—an optimistic approach. In: Proc. the Symposium on Reliable Distributed Systems, pp. 3–12, 1988
Neves, N., Fuchs, W.K.: Coordinated checkpointing without direct coordination. In: Proc. the IEEE International Computer Performance and Dependability Symposium, pp. 23–31, 1998
Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Computer Performance and Dependability Symposium, pp. 207–215, 1984
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. In: Proc. the Seventeenth IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 10–18, 1998
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)
Article Google Scholar
Elnozahy, E.N.: Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication, Ph.D. Thesis, Rice University (1993)
Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant distributed computing systems. ACM Trans. Comput. Syst. 1, 222–238 (1985)
Article Google Scholar
Lamport, L.: Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)
Article MATH Google Scholar
Caglayan, A., Harrison, C.: AGENT: Source Book. Wiley, New York (1997)
Google Scholar
Genesereth, M.R., Ketchpel, S.P.: Software agents. Commun. ACM 37(7), 48–53 (1994)
Article Google Scholar
Finin, T., Fritzson, R., Mckay, D., McEntire, R.: KQML as an agent communication language. In: Proc .CIKM ’94, pp. 126–130, 1994
Genesereth, M., Fikes, R.: Knowledge interchange format version 3.0 reference manual. Technical Report Logic-92-1, Computer Science Department, Stanford University (1992)

Download references

Author information

Authors and Affiliations

Division of Computer Science and Engineering, Soonchunhyang University, Asan-si, 336-745, Korea
HwaMin Lee & DooSoon Park
Dept. of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea
HeonChang Yu
Research and Development Center, Saman Corporation, Anyang, 431-050, Korea
Giyeol Lee

Authors

HwaMin Lee
View author publications
You can also search for this author inPubMed Google Scholar
DooSoon Park
View author publications
You can also search for this author inPubMed Google Scholar
HeonChang Yu
View author publications
You can also search for this author inPubMed Google Scholar
Giyeol Lee
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to HwaMin Lee.

Additional information

This work was supported by the Soonchunhyang University Research Fund 20080152.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, H., Park, D., Yu, H. et al. FRASystem: fault tolerant system using agents in distributed computing systems. Cluster Comput 14, 15–25 (2011). https://doi.org/10.1007/s10586-009-0095-x

Download citation

Received: 24 November 2008
Accepted: 25 June 2009
Published: 17 July 2009
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10586-009-0095-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FRASystem: fault tolerant system using agents in distributed computing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Hybrid Replication Schemes of Processes for Fault-Tolerance Systems in Energy-Efficient Server Clusters

Hybrid Replication Schemes of Processes in Energy-Efficient Server Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

FRASystem: fault tolerant system using agents in distributed computing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Hybrid Replication Schemes of Processes for Fault-Tolerance Systems in Energy-Efficient Server Clusters

Hybrid Replication Schemes of Processes in Energy-Efficient Server Clusters

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now