Skip to main content

Two techniques for transient software error recovery

  • Software Architectures for Fault Tolerance
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Abstract

The traditional approaches for fault tolerance in software — the recovery block approach and the N-version programming — are too expensive, and consequently of limited practical use. Experience has shown that techniques, such as rollback and retry, that do not employ multiple versions of software are able to mask a range of software faults that exhibit transient software failures. These techniques are cost effective as they do not employ design diversity for supporting fault tolerance. In this report we discuss two such techniques that can be used to enhance the reliability of software systems.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. E. Ammann and J. C. Knight. Data diversity: an approach to software fault tolerance. In Dij. of papers: 17th Int. Conf. on Fault Tolerant Comput. Sys., pages 122–126, Pittsburgh, 1987.

    Google Scholar 

  2. A. Avizienis. The n-version approach to fault tolerant software. IEEE Tran. on Software Engg., SE-11(12):1491–1501, Dec 1985.

    Google Scholar 

  3. J. F. Bartlett. A nonstop kernel. In Proc. of 7th ACM Symp. on Operating Sys., pages 22–29, 1981.

    Google Scholar 

  4. A. Borg, J. Baumback, and S. Galzer. A message system supporting fault tolerance. In 9th ACM Symp. on Op. Sys. Principles, Op. Sys. Review, 17:5, pages 90–99, 1983.

    Google Scholar 

  5. J. Gray. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers, Cupertino, CA, June 1985.

    Google Scholar 

  6. D. Gupta and P. Jalote Increasing system availability through on-line software version change. 23rd Int. Conf. on Fault Tolerance Computing Systems, Toulouse, France, pages 30–35, June 1993.

    Google Scholar 

  7. F. Cristian. Exception handling and software fault tolerance. IEEE Tran. on Comput., C-31(6):531–540, June 1982.

    Google Scholar 

  8. F. Cristian. Correct and robust programs. IEEE Tran. on Soft. Engg., SE-10(2):163–174, March 1984.

    Google Scholar 

  9. Y. Huang and C. M. R. Kintala. Software implemented fault tolerance: technologies and experience. 23rd Int. Conf. on Fault Tolerance Computing Systems, Toulouse, France, pages 2–9, June 1993.

    Google Scholar 

  10. G. Fowler and Y. Huang and D. Korn and H. C. Rao, “A User-Level Replicated File System,” Proceedings of Summer USENIX, pages 279–290, June, 1993.

    Google Scholar 

  11. P. Jalote. Fault tolerant processes. Distributed Computing, 3:187–195, 1989.

    Article  Google Scholar 

  12. D. B. Johnson and W. Zwaenepoel. Sender-based message logging. In Dij. of Papers, 17th Int. Conf. on Fault Tolerant Computing Sys., pages 14–19, 1987.

    Google Scholar 

  13. D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging an d checkpointing. Journal of Algorithms, 11:462–491, 1990.

    Article  Google Scholar 

  14. J. C. Knight and N. G. Leveson. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Tran. on Soft. Engg., SE-12(1):96–109, Jan 1986.

    Google Scholar 

  15. B. Randell. System structure for software fault tolerance. IEEE Tran. on Software Engg., SE-1:220–232, June 1975.

    Google Scholar 

  16. M. E. Segal and O. Frieder. On-the-fly modification: systems for dynamic updating. IEEE Software, pp. 53–65, March 1993.

    Google Scholar 

  17. R. E. Strom and S. Yemini. Optimistic recovery: an asynchronous approach to fault tolerance in distributed systems. In Proc. of 14th Symp. of Fault Tolerant Computing, pages 374–379, 1984.

    Google Scholar 

  18. Y. Wang, Y. Huang and K. Fuchs, “Progressive retry for software errors,” 23rd International Symposium on Fault Tolerant Computer Systems (FTCS-23), Toulouse, France, pages 138–144, June 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, Y., Jalote, P., Kintala, C. (1994). Two techniques for transient software error recovery. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020031

Download citation

  • DOI: https://doi.org/10.1007/BFb0020031

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics