Abstract
Dependable computer architectures used in critical embedded real-time applications have successfully employed Byzantine resilience techniques to tolerate physical, internal, operational faults. The dominant cause of failure of a correctly designed Byzantine resilient computer today is the common-mode failure, i.e., the nearly simultaneously failure of multiple redundant copies, generally due to a single cause. Unlike independent hardware faults, for which theoretically rigorous fault tolerance solutions have been implemented, the sources of common-mode failures are so diverse that numerous disparate techniques are required to predict, avoid, remove, and tolerate them.
This paper describes the technical approach that is being used to reduce the probability of common-mode failure in the Draper Fault Tolerant Parallel Processor which has been designed for critical embedded real-time applications. It begins with placing common-mode failures in the context of overall impairments to dependability to clarify their relative importance with respect to other failure sources. The FTPP's approach to tolerating independent hardware faults is briefly motivated and described. The overall strategy for common-mode failure reduction comprises three major areas: common-mode failure avoidance, removal, and tolerance. For fault avoidance, a novel integrated formal methods and VHDL design methodology has been developed and applied. Common-mode fault tolerance techniques include a combination of on-line checking of timing and functional behavior of operating system and application tasks, use of a formally verified system diagnosis processor to diagnose overall system health, and system-wide recovery actions. Techniques for the reduction of common-mode failure probability due to performance timing faults are also discussed.
This work was supported by NASA Langley Research Center under contract NAS1-18565.
Preview
Unable to display preview. Download preview PDF.
References
Abler, T., A Network Element Based Fault Tolerant Processor, MS Thesis, Massachusetts Institute of Technology, Cambridge, MA, May 1988.
D. Avresky, et al, “Fault Injection for the Formal Testing of Fault Tolerance”, 22nd International Symposium on Fault Tolerant Computing, Boston, MA, July 1992.
Babikyan, C., “The Fault Tolerant Parallel Processor Operating System Concepts and Performance Measurement Overview,” Proceedings of the 9th Digital Avionics Systems Conference, October 1990, pp. 366–371.
Harper, R., Critical Issues in Ultra-Reliable Parallel Processing, PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, June 1987.
Harper, R., Lala, J., Deyst, J., “Fault Tolerant Parallel Processor Overview,” 18th International Symposium on Fault Tolerant Computing, June 1988, pp. 252–257.
Harper, R., “Reliability Analysis of Parallel Processing Systems,” Proceedings of the 8th Digital Avionics Systems Conference., October 1988, pp. 213–219.
Harper, R., Lala, J., Fault Tolerant Parallel Processor, J. Guidance, Control, and Dynamics, V. 14, N. 3, May–June 1991, pp. 554–563.
R. Harper et. al., “Advanced Information Processing System: Army Fault Tolerant Architecture Conceptual Study Final Report, Volumes I and II”, NASA Contractor Report 189632, Langley Research Center, Hampton, VA, July 1992.
Second NASA Formal Methods Workshop, Compiled By S.C. Johnson, C.M. Holloway, and R.W. Butler, Proceedings of a workshop sponsored by NASA, Washington, DC and held at NASA Langley Research Center, August, 1992, NASA Conference Publication 10110.
Kopetz, H., et. al., “Distributed Fault-Tolerant Real-Time Systems: The MARS Approach,” IEEE Micro, 9(1):25–40, February 1991.
Lala, J. H., “An Advanced Information Processing System,” 6th AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD, December 1984.
Lala, J. H., “Advanced Information Processing System: Fault Detection and Error Handling,” AIAA Guidance, Navigation and Control Conf., Snowmass, CO, Aug. 1985.
Lala, J.H., “Fault Detection, Isolation, and Reconfiguration in the Fault Tolerant Multiprocessor”, Journal of Guidance, Control, and Dynamics, Sept–Oct. 1986, pp 585–592.
Lala, J. H., “A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications,” 16th Annual International Symposium on Fault Tolerant Computing Systems, Vienna, Austria, 1–4 July 1986.
Lala, J.H., and L.S. Alger, “Hardware and Software Fault Tolerance: A Unified Architectural Approach”, The 18th International Symposium on Fault Tolerant Computing, Tokyo, Japan, June 1988.
Dependability: Basic Concepts and Terminology. Ed: J.C. Laprie, Volume 5 of Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, Wien, New York, 1992, pp.11–16.
M. Srivas and M. Bickford, “Moving Formal Methods into Practice: Verifying the FTPP Scoreboard: Phase 1 Results”, NASA Contractor Report 189607, Langley Research Center, Hampton, VA, May 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lala, J.H., Harper, R.E. (1994). Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020040
Download citation
DOI: https://doi.org/10.1007/BFb0020040
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive