Abstract
The traditional approaches for fault tolerance in software — the recovery block approach and the N-version programming — are too expensive, and consequently of limited practical use. Experience has shown that techniques, such as rollback and retry, that do not employ multiple versions of software are able to mask a range of software faults that exhibit transient software failures. These techniques are cost effective as they do not employ design diversity for supporting fault tolerance. In this report we discuss two such techniques that can be used to enhance the reliability of software systems.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
P. E. Ammann and J. C. Knight. Data diversity: an approach to software fault tolerance. In Dij. of papers: 17th Int. Conf. on Fault Tolerant Comput. Sys., pages 122–126, Pittsburgh, 1987.
A. Avizienis. The n-version approach to fault tolerant software. IEEE Tran. on Software Engg., SE-11(12):1491–1501, Dec 1985.
J. F. Bartlett. A nonstop kernel. In Proc. of 7th ACM Symp. on Operating Sys., pages 22–29, 1981.
A. Borg, J. Baumback, and S. Galzer. A message system supporting fault tolerance. In 9th ACM Symp. on Op. Sys. Principles, Op. Sys. Review, 17:5, pages 90–99, 1983.
J. Gray. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers, Cupertino, CA, June 1985.
D. Gupta and P. Jalote Increasing system availability through on-line software version change. 23rd Int. Conf. on Fault Tolerance Computing Systems, Toulouse, France, pages 30–35, June 1993.
F. Cristian. Exception handling and software fault tolerance. IEEE Tran. on Comput., C-31(6):531–540, June 1982.
F. Cristian. Correct and robust programs. IEEE Tran. on Soft. Engg., SE-10(2):163–174, March 1984.
Y. Huang and C. M. R. Kintala. Software implemented fault tolerance: technologies and experience. 23rd Int. Conf. on Fault Tolerance Computing Systems, Toulouse, France, pages 2–9, June 1993.
G. Fowler and Y. Huang and D. Korn and H. C. Rao, “A User-Level Replicated File System,” Proceedings of Summer USENIX, pages 279–290, June, 1993.
P. Jalote. Fault tolerant processes. Distributed Computing, 3:187–195, 1989.
D. B. Johnson and W. Zwaenepoel. Sender-based message logging. In Dij. of Papers, 17th Int. Conf. on Fault Tolerant Computing Sys., pages 14–19, 1987.
D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging an d checkpointing. Journal of Algorithms, 11:462–491, 1990.
J. C. Knight and N. G. Leveson. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Tran. on Soft. Engg., SE-12(1):96–109, Jan 1986.
B. Randell. System structure for software fault tolerance. IEEE Tran. on Software Engg., SE-1:220–232, June 1975.
M. E. Segal and O. Frieder. On-the-fly modification: systems for dynamic updating. IEEE Software, pp. 53–65, March 1993.
R. E. Strom and S. Yemini. Optimistic recovery: an asynchronous approach to fault tolerance in distributed systems. In Proc. of 14th Symp. of Fault Tolerant Computing, pages 374–379, 1984.
Y. Wang, Y. Huang and K. Fuchs, “Progressive retry for software errors,” 23rd International Symposium on Fault Tolerant Computer Systems (FTCS-23), Toulouse, France, pages 138–144, June 1993.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, Y., Jalote, P., Kintala, C. (1994). Two techniques for transient software error recovery. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020031
Download citation
DOI: https://doi.org/10.1007/BFb0020031
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive