Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems

Kim, K. H.

doi:10.1007/978-3-642-75002-1_22

Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems

K. H. Kim³

Conference paper

87 Accesses
2 Citations

Part of the book series: Informatik-Fachberichte ((INFORMATIK,volume 214))

Abstract

The purpose of this paper is to summarize major issues in providing the capabilities for tolerance of both hardware faults and software faults in real-time computer systems (DCS’s). The paper starts with several guidelines considered to be highly useful in searching for effective system-level fault tolerance schemes. Some promising schemes are then reviewed.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anderson, T. and Lee, P.A., ‘Fault Tolerance: Principles and Practice’, Prentice-Hall Int’l, Inc., London, 1981.
Google Scholar
Avizienis, A., “The N-Version Approach to Fault-Tolerant Software”, IEEE Trans, on Software Engineering, Vol. Se-11, No. 12, December 1985, pp. 1491–1501.
Google Scholar
Avizienis, A., Kopetz, H., and Laprie, J.C. eds., ‘The Evolution of Fault-Tolerant Computing’, Springer-Verlag, New York, 1987.
MATH Google Scholar
Avi88] Avizienis, A., Lyu, M.R., and Schutz, W., “In Search of Effective Diversity: A Six- Language Study of Fault-Tolerant Flight Control Software”, Proc. FTCS-18, pp.15–22.
Google Scholar
Carter, W.C., “Hardware Fault Tolerance”, Chapter 2 in Anderson, T., ed., ‘Resilient Computing Systems’, Vol. 1, Wiley-lnterscience, 1985, pp. 11–63.
Google Scholar
Chu, W.W., Kim, K.H., and Mcdonald, W.C., “Testbed-based Evaluation of Design Techniques for Fault-Tolerant Real-Time Distributed Computer Systems”, Proceedings of the IEEE, Vol.75, No.5, Special Issue on Distributed Databases, May 1987, pp. 649–667.
Google Scholar
Gregory, S.T. and Knight, J.C., “A new Linguistic Approach to Backward Error Recovery”, Proc. FTCS-15, 1985, pp. 404–409.
Google Scholar
Hagelin, G., “ERICSSON Safety System for Railway Control”, in U. Voges ed., ‘Software Diversity in Computerized Control Systems’, Springer Verlag, Vienna, 1987, pp. 11–21.
Google Scholar
Hecht, M., Hochhauser, So, and Hecht, H., “Extended Distributed Recovery Blocks for Nuclear Reactor Control and Safety Functions,” Final Report, Contract DE-AC03-87-ER80532, Dec. 87.
Google Scholar
Hopkins, A.L.,, “FTMP-A highly Reliable Fault-Tolerant Multiprocessor for Aircraft”, Proc. IEEE, Vol. 66, No. 10, Oct. 1978, pp. 1221–1239.
Article Google Scholar
Horning, J.J., Lauer, H.C., Melliar-Smith, P.M., and Randell, B., “A program structure for error detection and recovery”, Lecture Notes in Comp. Sci., vol. 16, Springer-Verlag, 1974, pp. 171–187.
Google Scholar
Kelly, J.P.J,, “A Large Scale Second Generation Experiment in Multi-Version Software: Description and Early Results”, Proc. FTCS-18, pp.9–14.
Google Scholar
Kim, K.H., “An Approach to Programmer-Transparent Coordination of Recovering Parallel Processes and Its Efficient Implementation Rules”, Proc. 1978 Int’l Conf. on Parallel Processing, August 1978, pp. 58–68.
Google Scholar
Kim, K.H., ’Approaches to Mechanization of the Conversation Scheme Based on Monitor, IEEE Trans, on Software Eng., Vol. SE-8, No. 3, May 1982, pp. 189–197.
Google Scholar
Kim, K.H., “Distributed Execution of Recovery Blocks: an Approach to Uniform Treatment of Hardware and Software Faults”, Proc. 4th Int’l Conf. on Distributed Computing System, May 1984, pp. 526–532.
Google Scholar
Kim, K.H., Yang, S.M., and Kim, M.H., “Implementation of Concurrent Programming Language Facilities Supporting Conversation Structuring”, Proc. COMPSAC 85, Oct. 1985, pp. 445–453.
Google Scholar
Kim, K.H., Heu, S., and Yang, S.M., “An Analysis of the Execution Overhead Inherent in the Conversation Scheme”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, Jan. 1986, pp. 159–168.
Google Scholar
Kim, K.H., You, J.H., and Abouelnaga, A., “A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes”, Proc. 16th Int’l Conf. on Fault- Tolerant Computing, July 1986, pp. 130–135.
Google Scholar
Kim, K.H. and Yoon, J.C., “Approaches to Implementation of a Repairable Distributed Recovery Block Scheme”, Proc. 18th Int’l Symp. on Fault-Tolerant Computing (FTCS-18), pp.50–55.
Google Scholar
Kim, K.H., “Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation”, IEEE Trans, on Software Engr., Vol. 14, No. 6, June 1988, pp. 810–821.
Article Google Scholar
Kim, K.H., “Designing Fault Tolerance Capabilities into Real-Time Distributed Computer Systems”, Proc. IEEE Computer Society’s Workshop on Future Trends of Distributed Computing Systems in the 1990s, Sept. 1988, Hong Kong, pp.318–328.
Google Scholar
Kim, K.H. and Welch, H.O., “Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications”, IEEE Trans, on Computers, Vol. 38, No. 5, May 1989, pp. 626–636.
Article Google Scholar
Kim, K.H., “An Approach to Experimental Evaluation of Real-Time Fault-Tolerant Distributed Computing Schemes”, IEEE Trans, on Software Engineering, Vol. 15, No. 6, June 1989, pp. 715–725.
Article Google Scholar
Randell, B., “System structure for software fault tolerance”, IEEE Trans, on Software Engr., June 1975, pp. 220–232.
Google Scholar
Stratus Continuous Processing’, Stratus Computer, Inc., 1984.
Google Scholar
Tong, Z., Kain, R.Y., and Tsai, W.T., “A Loosely Synchronized Checkpointing Scheme for Rollback Recovery in Distributed Systems”, Tech. Report, TC-DS-13, Dept. of Electrical Engineering, Univ. of Minnesota, Minneapolis, MN 55455.
Google Scholar
Toy, W.N., “Fault-Tolerant Design of Local ESS Processors”, Proceedings of the IEEE, Vol. 66, No. 10, Oct. 1978, pp. 1126–1145.
Article Google Scholar
Toy, W.N., “Fault-Tolerant Computing”, A chapter in Advances in Computers, Vol. 26, Academic Press, 1987, pp. 201–279.
Google Scholar
Yang,S.M. and Kim, K.H., “Implementation of the Conversation Scheme into Loosely Coupled Distributed Computer Systems”, Proc. 9th Int’l Conf. on Distributed Computing Systems, June 1989.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering Program, Dept. of Electrical Engineering, University of California, Irvine, CA, 92717, USA
K. H. Kim

Authors

K. H. Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Rechnerentwurf und Fehlertoleranz, Fakultät für Informatik, Universität Karlsruhe, Postfach 6980, D-7500, Karlsruhe 1, Germany
Winfried Görke
ZBIT/ Elektro- und Prozeßleittechnik, Th. Goldschmidt AG, Goldschmidtstraße 100, D-4300, Essen 1, Germany
Holger Sörensen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, K.H. (1989). Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems. In: Görke, W., Sörensen, H. (eds) Fehlertolerierende Rechensysteme / Fault-tolerant Computing Systems. Informatik-Fachberichte, vol 214. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-75002-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-75002-1_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-51565-4
Online ISBN: 978-3-642-75002-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics