Skip to main content

Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems

  • Conference paper

Part of the book series: Informatik-Fachberichte ((INFORMATIK,volume 214))

Abstract

The purpose of this paper is to summarize major issues in providing the capabilities for tolerance of both hardware faults and software faults in real-time computer systems (DCS’s). The paper starts with several guidelines considered to be highly useful in searching for effective system-level fault tolerance schemes. Some promising schemes are then reviewed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderson, T. and Lee, P.A., ‘Fault Tolerance: Principles and Practice’, Prentice-Hall Int’l, Inc., London, 1981.

    Google Scholar 

  2. Avizienis, A., “The N-Version Approach to Fault-Tolerant Software”, IEEE Trans, on Software Engineering, Vol. Se-11, No. 12, December 1985, pp. 1491–1501.

    Google Scholar 

  3. Avizienis, A., Kopetz, H., and Laprie, J.C. eds., ‘The Evolution of Fault-Tolerant Computing’, Springer-Verlag, New York, 1987.

    MATH  Google Scholar 

  4. Avi88] Avizienis, A., Lyu, M.R., and Schutz, W., “In Search of Effective Diversity: A Six- Language Study of Fault-Tolerant Flight Control Software”, Proc. FTCS-18, pp.15–22.

    Google Scholar 

  5. Carter, W.C., “Hardware Fault Tolerance”, Chapter 2 in Anderson, T., ed., ‘Resilient Computing Systems’, Vol. 1, Wiley-lnterscience, 1985, pp. 11–63.

    Google Scholar 

  6. Chu, W.W., Kim, K.H., and Mcdonald, W.C., “Testbed-based Evaluation of Design Techniques for Fault-Tolerant Real-Time Distributed Computer Systems”, Proceedings of the IEEE, Vol.75, No.5, Special Issue on Distributed Databases, May 1987, pp. 649–667.

    Google Scholar 

  7. Gregory, S.T. and Knight, J.C., “A new Linguistic Approach to Backward Error Recovery”, Proc. FTCS-15, 1985, pp. 404–409.

    Google Scholar 

  8. Hagelin, G., “ERICSSON Safety System for Railway Control”, in U. Voges ed., ‘Software Diversity in Computerized Control Systems’, Springer Verlag, Vienna, 1987, pp. 11–21.

    Google Scholar 

  9. Hecht, M., Hochhauser, So, and Hecht, H., “Extended Distributed Recovery Blocks for Nuclear Reactor Control and Safety Functions,” Final Report, Contract DE-AC03-87-ER80532, Dec. 87.

    Google Scholar 

  10. Hopkins, A.L.,, “FTMP-A highly Reliable Fault-Tolerant Multiprocessor for Aircraft”, Proc. IEEE, Vol. 66, No. 10, Oct. 1978, pp. 1221–1239.

    Article  Google Scholar 

  11. Horning, J.J., Lauer, H.C., Melliar-Smith, P.M., and Randell, B., “A program structure for error detection and recovery”, Lecture Notes in Comp. Sci., vol. 16, Springer-Verlag, 1974, pp. 171–187.

    Google Scholar 

  12. Kelly, J.P.J,, “A Large Scale Second Generation Experiment in Multi-Version Software: Description and Early Results”, Proc. FTCS-18, pp.9–14.

    Google Scholar 

  13. Kim, K.H., “An Approach to Programmer-Transparent Coordination of Recovering Parallel Processes and Its Efficient Implementation Rules”, Proc. 1978 Int’l Conf. on Parallel Processing, August 1978, pp. 58–68.

    Google Scholar 

  14. Kim, K.H., ’Approaches to Mechanization of the Conversation Scheme Based on Monitor, IEEE Trans, on Software Eng., Vol. SE-8, No. 3, May 1982, pp. 189–197.

    Google Scholar 

  15. Kim, K.H., “Distributed Execution of Recovery Blocks: an Approach to Uniform Treatment of Hardware and Software Faults”, Proc. 4th Int’l Conf. on Distributed Computing System, May 1984, pp. 526–532.

    Google Scholar 

  16. Kim, K.H., Yang, S.M., and Kim, M.H., “Implementation of Concurrent Programming Language Facilities Supporting Conversation Structuring”, Proc. COMPSAC 85, Oct. 1985, pp. 445–453.

    Google Scholar 

  17. Kim, K.H., Heu, S., and Yang, S.M., “An Analysis of the Execution Overhead Inherent in the Conversation Scheme”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, Jan. 1986, pp. 159–168.

    Google Scholar 

  18. Kim, K.H., You, J.H., and Abouelnaga, A., “A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes”, Proc. 16th Int’l Conf. on Fault- Tolerant Computing, July 1986, pp. 130–135.

    Google Scholar 

  19. Kim, K.H. and Yoon, J.C., “Approaches to Implementation of a Repairable Distributed Recovery Block Scheme”, Proc. 18th Int’l Symp. on Fault-Tolerant Computing (FTCS-18), pp.50–55.

    Google Scholar 

  20. Kim, K.H., “Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation”, IEEE Trans, on Software Engr., Vol. 14, No. 6, June 1988, pp. 810–821.

    Article  Google Scholar 

  21. Kim, K.H., “Designing Fault Tolerance Capabilities into Real-Time Distributed Computer Systems”, Proc. IEEE Computer Society’s Workshop on Future Trends of Distributed Computing Systems in the 1990s, Sept. 1988, Hong Kong, pp.318–328.

    Google Scholar 

  22. Kim, K.H. and Welch, H.O., “Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications”, IEEE Trans, on Computers, Vol. 38, No. 5, May 1989, pp. 626–636.

    Article  Google Scholar 

  23. Kim, K.H., “An Approach to Experimental Evaluation of Real-Time Fault-Tolerant Distributed Computing Schemes”, IEEE Trans, on Software Engineering, Vol. 15, No. 6, June 1989, pp. 715–725.

    Article  Google Scholar 

  24. Randell, B., “System structure for software fault tolerance”, IEEE Trans, on Software Engr., June 1975, pp. 220–232.

    Google Scholar 

  25. Stratus Continuous Processing’, Stratus Computer, Inc., 1984.

    Google Scholar 

  26. Tong, Z., Kain, R.Y., and Tsai, W.T., “A Loosely Synchronized Checkpointing Scheme for Rollback Recovery in Distributed Systems”, Tech. Report, TC-DS-13, Dept. of Electrical Engineering, Univ. of Minnesota, Minneapolis, MN 55455.

    Google Scholar 

  27. Toy, W.N., “Fault-Tolerant Design of Local ESS Processors”, Proceedings of the IEEE, Vol. 66, No. 10, Oct. 1978, pp. 1126–1145.

    Article  Google Scholar 

  28. Toy, W.N., “Fault-Tolerant Computing”, A chapter in Advances in Computers, Vol. 26, Academic Press, 1987, pp. 201–279.

    Google Scholar 

  29. Yang,S.M. and Kim, K.H., “Implementation of the Conversation Scheme into Loosely Coupled Distributed Computer Systems”, Proc. 9th Int’l Conf. on Distributed Computing Systems, June 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1989 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, K.H. (1989). Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems. In: Görke, W., Sörensen, H. (eds) Fehlertolerierende Rechensysteme / Fault-tolerant Computing Systems. Informatik-Fachberichte, vol 214. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-75002-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-75002-1_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-51565-4

  • Online ISBN: 978-3-642-75002-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics