Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures

Lala, Jaynarayan H.; Harper, Richard E.

doi:10.1007/BFb0020040

Jaynarayan H. Lala¹ &
Richard E. Harper¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Workshop on Fault Tolerance

193 Accesses
1 Citations

Abstract

Dependable computer architectures used in critical embedded real-time applications have successfully employed Byzantine resilience techniques to tolerate physical, internal, operational faults. The dominant cause of failure of a correctly designed Byzantine resilient computer today is the common-mode failure, i.e., the nearly simultaneously failure of multiple redundant copies, generally due to a single cause. Unlike independent hardware faults, for which theoretically rigorous fault tolerance solutions have been implemented, the sources of common-mode failures are so diverse that numerous disparate techniques are required to predict, avoid, remove, and tolerate them.

This paper describes the technical approach that is being used to reduce the probability of common-mode failure in the Draper Fault Tolerant Parallel Processor which has been designed for critical embedded real-time applications. It begins with placing common-mode failures in the context of overall impairments to dependability to clarify their relative importance with respect to other failure sources. The FTPP's approach to tolerating independent hardware faults is briefly motivated and described. The overall strategy for common-mode failure reduction comprises three major areas: common-mode failure avoidance, removal, and tolerance. For fault avoidance, a novel integrated formal methods and VHDL design methodology has been developed and applied. Common-mode fault tolerance techniques include a combination of on-line checking of timing and functional behavior of operating system and application tasks, use of a formally verified system diagnosis processor to diagnose overall system health, and system-wide recovery actions. Techniques for the reduction of common-mode failure probability due to performance timing faults are also discussed.

This work was supported by NASA Langley Research Center under contract NAS1-18565.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abler, T., A Network Element Based Fault Tolerant Processor, MS Thesis, Massachusetts Institute of Technology, Cambridge, MA, May 1988.
Google Scholar
D. Avresky, et al, “Fault Injection for the Formal Testing of Fault Tolerance”, 22nd International Symposium on Fault Tolerant Computing, Boston, MA, July 1992.
Google Scholar
Babikyan, C., “The Fault Tolerant Parallel Processor Operating System Concepts and Performance Measurement Overview,” Proceedings of the 9th Digital Avionics Systems Conference, October 1990, pp. 366–371.
Google Scholar
Harper, R., Critical Issues in Ultra-Reliable Parallel Processing, PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, June 1987.
Google Scholar
Harper, R., Lala, J., Deyst, J., “Fault Tolerant Parallel Processor Overview,” 18th International Symposium on Fault Tolerant Computing, June 1988, pp. 252–257.
Google Scholar
Harper, R., “Reliability Analysis of Parallel Processing Systems,” Proceedings of the 8th Digital Avionics Systems Conference., October 1988, pp. 213–219.
Google Scholar
Harper, R., Lala, J., Fault Tolerant Parallel Processor, J. Guidance, Control, and Dynamics, V. 14, N. 3, May–June 1991, pp. 554–563.
Google Scholar
R. Harper et. al., “Advanced Information Processing System: Army Fault Tolerant Architecture Conceptual Study Final Report, Volumes I and II”, NASA Contractor Report 189632, Langley Research Center, Hampton, VA, July 1992.
Google Scholar
Second NASA Formal Methods Workshop, Compiled By S.C. Johnson, C.M. Holloway, and R.W. Butler, Proceedings of a workshop sponsored by NASA, Washington, DC and held at NASA Langley Research Center, August, 1992, NASA Conference Publication 10110.
Google Scholar
Kopetz, H., et. al., “Distributed Fault-Tolerant Real-Time Systems: The MARS Approach,” IEEE Micro, 9(1):25–40, February 1991.
Article Google Scholar
Lala, J. H., “An Advanced Information Processing System,” 6th AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD, December 1984.
Google Scholar
Lala, J. H., “Advanced Information Processing System: Fault Detection and Error Handling,” AIAA Guidance, Navigation and Control Conf., Snowmass, CO, Aug. 1985.
Google Scholar
Lala, J.H., “Fault Detection, Isolation, and Reconfiguration in the Fault Tolerant Multiprocessor”, Journal of Guidance, Control, and Dynamics, Sept–Oct. 1986, pp 585–592.
Google Scholar
Lala, J. H., “A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications,” 16^th Annual International Symposium on Fault Tolerant Computing Systems, Vienna, Austria, 1–4 July 1986.
Google Scholar
Lala, J.H., and L.S. Alger, “Hardware and Software Fault Tolerance: A Unified Architectural Approach”, The 18th International Symposium on Fault Tolerant Computing, Tokyo, Japan, June 1988.
Google Scholar
Dependability: Basic Concepts and Terminology. Ed: J.C. Laprie, Volume 5 of Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, Wien, New York, 1992, pp.11–16.
Google Scholar
M. Srivas and M. Bickford, “Moving Formal Methods into Practice: Verifying the FTPP Scoreboard: Phase 1 Results”, NASA Contractor Report 189607, Langley Research Center, Hampton, VA, May 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

The Charles Stark Draper Laboratory, Advanced Computer Architectures Group, 555 Technology Square, MS 73, 02139, Cambridge, MA
Jaynarayan H. Lala & Richard E. Harper

Authors

Jaynarayan H. Lala
View author publications
You can also search for this author in PubMed Google Scholar
Richard E. Harper
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lala, J.H., Harper, R.E. (1994). Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020040

Download citation

DOI: https://doi.org/10.1007/BFb0020040
Published: 10 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics