Skip to main content

Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures

  • Embedded and Real-Time Systems
  • Conference paper
  • First Online:
Hardware and Software Architectures for Fault Tolerance (Fault Tolerance 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Abstract

Dependable computer architectures used in critical embedded real-time applications have successfully employed Byzantine resilience techniques to tolerate physical, internal, operational faults. The dominant cause of failure of a correctly designed Byzantine resilient computer today is the common-mode failure, i.e., the nearly simultaneously failure of multiple redundant copies, generally due to a single cause. Unlike independent hardware faults, for which theoretically rigorous fault tolerance solutions have been implemented, the sources of common-mode failures are so diverse that numerous disparate techniques are required to predict, avoid, remove, and tolerate them.

This paper describes the technical approach that is being used to reduce the probability of common-mode failure in the Draper Fault Tolerant Parallel Processor which has been designed for critical embedded real-time applications. It begins with placing common-mode failures in the context of overall impairments to dependability to clarify their relative importance with respect to other failure sources. The FTPP's approach to tolerating independent hardware faults is briefly motivated and described. The overall strategy for common-mode failure reduction comprises three major areas: common-mode failure avoidance, removal, and tolerance. For fault avoidance, a novel integrated formal methods and VHDL design methodology has been developed and applied. Common-mode fault tolerance techniques include a combination of on-line checking of timing and functional behavior of operating system and application tasks, use of a formally verified system diagnosis processor to diagnose overall system health, and system-wide recovery actions. Techniques for the reduction of common-mode failure probability due to performance timing faults are also discussed.

This work was supported by NASA Langley Research Center under contract NAS1-18565.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abler, T., A Network Element Based Fault Tolerant Processor, MS Thesis, Massachusetts Institute of Technology, Cambridge, MA, May 1988.

    Google Scholar 

  2. D. Avresky, et al, “Fault Injection for the Formal Testing of Fault Tolerance”, 22nd International Symposium on Fault Tolerant Computing, Boston, MA, July 1992.

    Google Scholar 

  3. Babikyan, C., “The Fault Tolerant Parallel Processor Operating System Concepts and Performance Measurement Overview,” Proceedings of the 9th Digital Avionics Systems Conference, October 1990, pp. 366–371.

    Google Scholar 

  4. Harper, R., Critical Issues in Ultra-Reliable Parallel Processing, PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, June 1987.

    Google Scholar 

  5. Harper, R., Lala, J., Deyst, J., “Fault Tolerant Parallel Processor Overview,” 18th International Symposium on Fault Tolerant Computing, June 1988, pp. 252–257.

    Google Scholar 

  6. Harper, R., “Reliability Analysis of Parallel Processing Systems,” Proceedings of the 8th Digital Avionics Systems Conference., October 1988, pp. 213–219.

    Google Scholar 

  7. Harper, R., Lala, J., Fault Tolerant Parallel Processor, J. Guidance, Control, and Dynamics, V. 14, N. 3, May–June 1991, pp. 554–563.

    Google Scholar 

  8. R. Harper et. al., “Advanced Information Processing System: Army Fault Tolerant Architecture Conceptual Study Final Report, Volumes I and II”, NASA Contractor Report 189632, Langley Research Center, Hampton, VA, July 1992.

    Google Scholar 

  9. Second NASA Formal Methods Workshop, Compiled By S.C. Johnson, C.M. Holloway, and R.W. Butler, Proceedings of a workshop sponsored by NASA, Washington, DC and held at NASA Langley Research Center, August, 1992, NASA Conference Publication 10110.

    Google Scholar 

  10. Kopetz, H., et. al., “Distributed Fault-Tolerant Real-Time Systems: The MARS Approach,” IEEE Micro, 9(1):25–40, February 1991.

    Article  Google Scholar 

  11. Lala, J. H., “An Advanced Information Processing System,” 6th AIAA-IEEE Digital Avionics Systems Conference, Baltimore, MD, December 1984.

    Google Scholar 

  12. Lala, J. H., “Advanced Information Processing System: Fault Detection and Error Handling,” AIAA Guidance, Navigation and Control Conf., Snowmass, CO, Aug. 1985.

    Google Scholar 

  13. Lala, J.H., “Fault Detection, Isolation, and Reconfiguration in the Fault Tolerant Multiprocessor”, Journal of Guidance, Control, and Dynamics, Sept–Oct. 1986, pp 585–592.

    Google Scholar 

  14. Lala, J. H., “A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications,” 16th Annual International Symposium on Fault Tolerant Computing Systems, Vienna, Austria, 1–4 July 1986.

    Google Scholar 

  15. Lala, J.H., and L.S. Alger, “Hardware and Software Fault Tolerance: A Unified Architectural Approach”, The 18th International Symposium on Fault Tolerant Computing, Tokyo, Japan, June 1988.

    Google Scholar 

  16. Dependability: Basic Concepts and Terminology. Ed: J.C. Laprie, Volume 5 of Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, Wien, New York, 1992, pp.11–16.

    Google Scholar 

  17. M. Srivas and M. Bickford, “Moving Formal Methods into Practice: Verifying the FTPP Scoreboard: Phase 1 Results”, NASA Contractor Report 189607, Langley Research Center, Hampton, VA, May 1992.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lala, J.H., Harper, R.E. (1994). Fault tolerance in embedded real-time systems: Importance and treatment of common mode failures. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020040

Download citation

  • DOI: https://doi.org/10.1007/BFb0020040

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics