Affordable fault tolerance through adaptation

Chang, Ilwoo; Hiltunen, Matti A.; Schlichting, Richard D.

doi:10.1007/3-540-64359-1_730

Ilwoo Chang¹,
Matti A. Hiltunen¹ &
Richard D. Schlichting¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1388))

Included in the following conference series:

International Parallel Processing Symposium

116 Accesses
5 Citations

Abstract

Fault-tolerant programs are typically not only difficult to implement but also incur extra costs in tenors of performance or resource consumption. Failures are typically relatively rare but the fault-tolerance overhead must be paid regardless if any failures occur during the program execution. This paper presents an approach that reduces the cost of fault-tolerance, namely, adaptations to a change in failure model. In particular, a program that assumes no failures (or only benign failures) is combined with a component that is responsible for detecting if failures occur and then switching to a fault-tolerant algorithm. Provided that the detection and adaptation mechanisms are not too expensive, this approach results in a program with smaller fault-tolerance overhead and thus a better performance than a traditional fault-tolerant program. Thus, the high cost of fault-tolerance is only paid when failures actually occur.

This work supported in part by the National Science Foundation under grant CCR-9633336 and Defense Advanced Research Projects Agency under grants F30602-96-1-0342 and N6600197-C-8518.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Errors and Faults

Fault Tolerance: Theory and Concepts

Recovery: Searching and Monitoring of Correct Software States

References

A. Arora, M. Gouda, and T. Herman. Composite routing protocols. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing, Dec 1990.
Google Scholar
P. Bell and K. Jabbour. Review of point-to-point network routing algorithms. IEEE Communications Magazine, 24(1):34–38, 1986.
Article Google Scholar
R. Bianchini, K. Goodwin, and D. Nydick. Practical application and implementation of distributed system-level diagnosis theory. In Proceedings of the 20th Symposium on Fault-Tolerant Computing, pages 332–339, Jun 1990.
Google Scholar
A. Bondavalli, F. Di Giandomenico, and J. Xu. A cost-effective and flexible scheme for software fault tolerance. Journal of Computer Systems Science and Engineering, 8:234–244, 1993.
Google Scholar
F. Cristian. Reaching agreement on processor-group membership in synchronous distributed systems. Distributed Computing, 4:175–187, 1991.
Article Google Scholar
F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th Symposium on Fault-Tolerant Computing, pages 200–206, Ann Arbor, MI, Jun 1985.
Google Scholar
F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, 1995.
Article Google Scholar
E. W. Dijkstra. Self-stabilization in spite of distributed control. Communications of the ACM, 17(11):643–644, Nov 1974.
Article Google Scholar
V. Estivill-Castro and D. Woods. A survey of adaptive sorting algorithms. ACM Computing Surveys, 24(4):441–476, Dec 1992.
Article Google Scholar
J. Goldberg, l. Greenberg, and T. Lawrence. Adaptive fault tolerance. In Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pages 127–132, Princeton, NJ, Oct 1993.
Google Scholar
A. Gopal and K. Pery. Unifying self-stabilization and fault-tolerance. In Proceedings of the 12th ACMSymposium on Principles of Distributed Computing, pages 195–206,1993.
Google Scholar
M. Gouda and T. Herman. Adaptive programming. IEEE Transactions on Software Engineering, SE-17:911–921, 1991.
Article Google Scholar
M. Hiltunen, X. Han, and R. Schlichting. Real-time issues in Cactus. In Proceedings of the IEEE Workshop on Middleware for Distributed Real-Time Systems and Services, pages 214–221, San Francisco, CA, Dec 1997.
Google Scholar
M. Hiltunen and R. Schlichting. Adaptive distributed and fault-tolerant systems. Computer Systems Science and Engineering, 11(5):125–133, Sep 1996.
Google Scholar
H. Kopetz, G. Grunsteidl, and J. Reisinger. Fault-tolerant membership service in a synchronous distributed real-time system. In A. Avizienis and J. Laprie, editors, Dependable Computingfor Critical Applications, pages 411–429. Springer-Verlag, Wien, 1991.
Google Scholar
L. Lamport, R. Shostak, and P. M. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, Jul 1982.
Article Google Scholar
L. Peterson, N. Buchholz, and R. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217–246, Aug 1989.
Article Google Scholar
F. Preparata, G. Metze, and R. Chien. On the connection assignment problem of diagnosable systems. IEEE Transactions on Electronic Computer, EC-16(6):848–854, Dec 1967.
Google Scholar
R. Rajkumar, S. Fakhouri, and F. Jahanian. Processor group membership protocols: Specification, design, and implementation. In Proceedings of the 12th Symposium on Reliable Distributed Systems, pages 2–11, Princeton, NJ, Oct 1993.
Google Scholar
R. Schlichting and M. Hiltunen. The Cactus project. http://www.cs.arizona.edu/cactus/.
Google Scholar
M. Schneider. Self-stabilization. ACM Computing Surveys, 25(1):45–67, Mar 1993.
Article Google Scholar
C. Walter, M. Hugue, and N. Suri. Continual on-line diagnosis of hybrid faults. In F. Cristian, G. Le Lann, and T. Lunt, editors, Dependable Computing for Critical Applications 4, pages 233–249. Springer-Verlag, Wien, 1995. *** DIRECT SUPPORT *** A0008D07 00019
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Arizona, 85712, Tucson, AZ
Ilwoo Chang, Matti A. Hiltunen & Richard D. Schlichting

Authors

Ilwoo Chang
View author publications
You can also search for this author in PubMed Google Scholar
Matti A. Hiltunen
View author publications
You can also search for this author in PubMed Google Scholar
Richard D. Schlichting
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

José Rolim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, I., Hiltunen, M.A., Schlichting, R.D. (1998). Affordable fault tolerance through adaptation. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_730

Download citation

DOI: https://doi.org/10.1007/3-540-64359-1_730
Published: 08 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64359-3
Online ISBN: 978-3-540-69756-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Affordable fault tolerance through adaptation

Abstract

Access this chapter

Preview

Similar content being viewed by others

Errors and Faults

Fault Tolerance: Theory and Concepts

Recovery: Searching and Monitoring of Correct Software States

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Affordable fault tolerance through adaptation

Abstract

Access this chapter

Preview

Similar content being viewed by others

Errors and Faults

Fault Tolerance: Theory and Concepts

Recovery: Searching and Monitoring of Correct Software States

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation