Skip to main content

Affordable fault tolerance through adaptation

  • Workshop on Fault-Tolerant Parallel and Distributed Systems Dimiter Avresky, Boston University David R. Kaeli, Northeastern University
  • Conference paper
  • First Online:
Parallel and Distributed Processing (IPPS 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1388))

Included in the following conference series:

Abstract

Fault-tolerant programs are typically not only difficult to implement but also incur extra costs in tenors of performance or resource consumption. Failures are typically relatively rare but the fault-tolerance overhead must be paid regardless if any failures occur during the program execution. This paper presents an approach that reduces the cost of fault-tolerance, namely, adaptations to a change in failure model. In particular, a program that assumes no failures (or only benign failures) is combined with a component that is responsible for detecting if failures occur and then switching to a fault-tolerant algorithm. Provided that the detection and adaptation mechanisms are not too expensive, this approach results in a program with smaller fault-tolerance overhead and thus a better performance than a traditional fault-tolerant program. Thus, the high cost of fault-tolerance is only paid when failures actually occur.

This work supported in part by the National Science Foundation under grant CCR-9633336 and Defense Advanced Research Projects Agency under grants F30602-96-1-0342 and N6600197-C-8518.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. A. Arora, M. Gouda, and T. Herman. Composite routing protocols. In Proceedings of the IEEE Symposium on Parallel and Distributed Processing, Dec 1990.

    Google Scholar 

  2. P. Bell and K. Jabbour. Review of point-to-point network routing algorithms. IEEE Communications Magazine, 24(1):34–38, 1986.

    Article  Google Scholar 

  3. R. Bianchini, K. Goodwin, and D. Nydick. Practical application and implementation of distributed system-level diagnosis theory. In Proceedings of the 20th Symposium on Fault-Tolerant Computing, pages 332–339, Jun 1990.

    Google Scholar 

  4. A. Bondavalli, F. Di Giandomenico, and J. Xu. A cost-effective and flexible scheme for software fault tolerance. Journal of Computer Systems Science and Engineering, 8:234–244, 1993.

    Google Scholar 

  5. F. Cristian. Reaching agreement on processor-group membership in synchronous distributed systems. Distributed Computing, 4:175–187, 1991.

    Article  Google Scholar 

  6. F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th Symposium on Fault-Tolerant Computing, pages 200–206, Ann Arbor, MI, Jun 1985.

    Google Scholar 

  7. F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, 1995.

    Article  Google Scholar 

  8. E. W. Dijkstra. Self-stabilization in spite of distributed control. Communications of the ACM, 17(11):643–644, Nov 1974.

    Article  Google Scholar 

  9. V. Estivill-Castro and D. Woods. A survey of adaptive sorting algorithms. ACM Computing Surveys, 24(4):441–476, Dec 1992.

    Article  Google Scholar 

  10. J. Goldberg, l. Greenberg, and T. Lawrence. Adaptive fault tolerance. In Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pages 127–132, Princeton, NJ, Oct 1993.

    Google Scholar 

  11. A. Gopal and K. Pery. Unifying self-stabilization and fault-tolerance. In Proceedings of the 12th ACMSymposium on Principles of Distributed Computing, pages 195–206,1993.

    Google Scholar 

  12. M. Gouda and T. Herman. Adaptive programming. IEEE Transactions on Software Engineering, SE-17:911–921, 1991.

    Article  Google Scholar 

  13. M. Hiltunen, X. Han, and R. Schlichting. Real-time issues in Cactus. In Proceedings of the IEEE Workshop on Middleware for Distributed Real-Time Systems and Services, pages 214–221, San Francisco, CA, Dec 1997.

    Google Scholar 

  14. M. Hiltunen and R. Schlichting. Adaptive distributed and fault-tolerant systems. Computer Systems Science and Engineering, 11(5):125–133, Sep 1996.

    Google Scholar 

  15. H. Kopetz, G. Grunsteidl, and J. Reisinger. Fault-tolerant membership service in a synchronous distributed real-time system. In A. Avizienis and J. Laprie, editors, Dependable Computingfor Critical Applications, pages 411–429. Springer-Verlag, Wien, 1991.

    Google Scholar 

  16. L. Lamport, R. Shostak, and P. M. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, Jul 1982.

    Article  Google Scholar 

  17. L. Peterson, N. Buchholz, and R. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217–246, Aug 1989.

    Article  Google Scholar 

  18. F. Preparata, G. Metze, and R. Chien. On the connection assignment problem of diagnosable systems. IEEE Transactions on Electronic Computer, EC-16(6):848–854, Dec 1967.

    Google Scholar 

  19. R. Rajkumar, S. Fakhouri, and F. Jahanian. Processor group membership protocols: Specification, design, and implementation. In Proceedings of the 12th Symposium on Reliable Distributed Systems, pages 2–11, Princeton, NJ, Oct 1993.

    Google Scholar 

  20. R. Schlichting and M. Hiltunen. The Cactus project. http://www.cs.arizona.edu/cactus/.

    Google Scholar 

  21. M. Schneider. Self-stabilization. ACM Computing Surveys, 25(1):45–67, Mar 1993.

    Article  Google Scholar 

  22. C. Walter, M. Hugue, and N. Suri. Continual on-line diagnosis of hybrid faults. In F. Cristian, G. Le Lann, and T. Lunt, editors, Dependable Computing for Critical Applications 4, pages 233–249. Springer-Verlag, Wien, 1995. *** DIRECT SUPPORT *** A0008D07 00019

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

José Rolim

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, I., Hiltunen, M.A., Schlichting, R.D. (1998). Affordable fault tolerance through adaptation. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_730

Download citation

  • DOI: https://doi.org/10.1007/3-540-64359-1_730

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64359-3

  • Online ISBN: 978-3-540-69756-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics