Skip to main content

Reliable computing systems

  • Chapter 3.: Issues And Results In The Design Of Operating Systems
  • Chapter
  • First Online:
Operating Systems

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 60))

Abstract

The paper presents an analysis of the various problems involved in achieving very high reliability from complex computing systems, and discusses the relationship between system structuring techniques and techniques of fault tolerance. Topics covered include (i) differing types of reliability requirement, (ii) forms of protective redundancy in hardware and software systems, (iii) methods of structuring the activity of a system, using atomic actions, so as to limit information flow, (iv) error detection techniques, (v) strategies for locating and dealing with faults, and for assessing the damage they have caused, and (vi) forward and backward error recovery techniques, based on the concepts of recovery line, commitment, exception and compensation. A set of appendices provide summary descriptions and analyses of a number of computing systems that have been specifically designed with the aim of achieving very high reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

8 References

  1. Anderson, T., R. Kerr. Recovery Blocks in Action: a system supporting high reliability. Proc. Int. Conf. on Software Engineering San Francisco (Oct. 1976).

    Google Scholar 

  2. Anderson, T., P.A. Lee, S.K. Shrivastava. A Conceptual Model of Recoverability in Multi-Level Systems. Technical Report 115, Computing Laboratory, The University, Newcastle upon Tyne (Nov. 1977).

    Google Scholar 

  3. Avizienis, A. et al. The STAR (Self Testing and Repairing Computer): An Investigation of the Theory and Practice of Fault Tolerant Computer Design. IEEE Trans. on Computers, C-20, 11 (Nov. 1971), 1312–1321.

    Google Scholar 

  4. Avizienis, A., D.A. Rennels. Fault Tolerance Experiments With the JPL-STAR Computer. IEEE Compcon 72, (1972), 321–324.

    Google Scholar 

  5. Avizienis, A. Fault-Tolerant Systems. IEEE Trans. on Computers C-25, 12 (Dec. 1976), 1304–1312.

    Google Scholar 

  6. Banatre, J.-P., S.K. Shrivastava. Reliable Resource Allocation Between Unreliable Processes. Technical Report 99, Computing Laboratory, The University, Newcastle upon Tyne (June 1977).

    Google Scholar 

  7. Baskin, H.B., B.R. Borgerson, R. Roberts. PRIME-A Modular Architecture for Terminal-Orientated Systems. Proc. AFIPS 1972 SJCC 40 (1972), 431–437.

    Google Scholar 

  8. Bell System Technical Journal. (Sept. 1964).

    Google Scholar 

  9. Bell System Technical Journal. (Feb. 1977).

    Google Scholar 

  10. Bjork, L.A., C.T. Davies. The Semantics of the Preservation and Recovery of Integrity in a Data System. Report TR 02.540, IBM, San Jose, Calif. (Dec. 1972).

    Google Scholar 

  11. Bjork, L.A. Generalised Audit Trail (Ledger) Concepts for Data Base Applications. Report TR 02.641, IBM, San Jose, Calif. (Sept. 1974).

    Google Scholar 

  12. Borgerson, B.R. A Fail-Softly System For Timesharing Use. Digest of papers FTC-2. (1972), 89–93.

    Google Scholar 

  13. Borgerson, B.R. Spontaneous Reconfiguration in a Fail-Softly Computer Utility. Datafair (1973), 326–331.

    Google Scholar 

  14. Borgerson, B.R., R.F. Freitas. An Analysis of PRIME Using a New Reliability Model. Digest of papers FTC-4, (1974), 2.26–2.31.

    Google Scholar 

  15. Brinch Hansen, P. Operating System Principles. Prentice-Hall, Englewood Cliffs, N.J. (1973).

    Google Scholar 

  16. Brinch Hansen, P. The Programming Language Concurrent Pascal. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 199–207.

    Google Scholar 

  17. Clement, C.F., R.D. Toyer. Recovery From Faults in the No. 1A Processor. FTC-4 (1974), 5.2–5.7.

    Google Scholar 

  18. Cohen, E.S. Strong Dependency: a formalism for describing information transmission in computation systems. Technical Report, Computer Science Dept, Carnegie-Mellon Univ., Pittsburgh, PA (Aug. 1976).

    Google Scholar 

  19. Cohen, E.S. On Mechanisms for Solving Problems in Computational Systems. (In preparation.)

    Google Scholar 

  20. Cosserat, D.C. A Capability Oriented Multi-processor System for Real-Time Applications. Int. Conf. On Computer Communications. Washington, D.C. (Oct. 1972), 287–289.

    Google Scholar 

  21. Darton, K.S. The Dependable Process Computer. Electrical Review 186, 6 (Feb. 1970), 207–209.

    Google Scholar 

  22. Davies, C.T. A Recovery/Integrity Architecture for a Data System. Report TR 02.528, IBM, San Jose, Calif. (May 1972).

    Google Scholar 

  23. Depledge, P.G., M.G. Hartley. Fault-Tolerant Microcomputer Systems for Aircraft. Proc. Conf. On Computer Systems and Technology, University of Sussex, Institute of Electronic and Radio Engineers, London (1977), 205–220.

    Google Scholar 

  24. Dijkstra E.W. The Structure of the THE Multiprogramming System. Comm. ACM 11, 5 (1968), 341–346.

    Google Scholar 

  25. Dijkstra, E.W. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J. (1976).

    Google Scholar 

  26. Edelberg, M. Data Base Contamination and Recovery. Proc. ACM SIGMOD Workshop on Data Description, Access and Control (May 1974), 419–430.

    Google Scholar 

  27. Eswaran, K.P., J.N. Gray, R.A. Lorie, I.L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Comm. ACM 19, 11 (Nov. 1976), 624–633.

    Google Scholar 

  28. Fabry, R.S. Dynamic Verification of Operating System Decisions. Comm. ACM 16, 11 (1973), 659–668.

    Google Scholar 

  29. Goodenough, J.B. Exception Handling: Issues and a Proposed Notation. Comm. ACM 18, 12 (1975), 683–696.

    Google Scholar 

  30. Gray, J.N., R.A. Lorie, G.R. Putzolu, L.L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Database. IBM Research Report RJ1654 (Sept. 1975).

    Google Scholar 

  31. Gray, J.N. (Private Communication).

    Google Scholar 

  32. Hamer-Hodges, K. Fault Resistance and Recovery within System 250. Int. Conf. On Computer Communications. Washington (Oct. 1972), 290–296.

    Google Scholar 

  33. Heart, F.E., S.M. Ornstein, W.R. Crowther, W.B. Barker. A new minicomputer/multiprocessor for the ARPA network. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1973), 529–537.

    Google Scholar 

  34. Hecht, H. Fault Tolerant Software for a Fault Tolerant Computer. Software Systems Engineering. Online, Uxbridge (1976), 235–348.

    Google Scholar 

  35. Hoare, C.A.R. Monitors: an operating system structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549–537.

    Google Scholar 

  36. Horning, J.J., B. Randell. Process Structuring. Comp. Surveys 5, 1 (1973), 5–30.

    Google Scholar 

  37. Horning, J.J., H.C. Lauer, P.M. Melliar-Smith, B. Randell. A Program Structure for Error Detection and Recovery. Proc. Conf. On Operating Systems: Theoretical and Practical Aspects. IRIA (1974), 177–193. (Reprinted in Lecture Notes in Computer Science, Vol. 16, Springer-Verlag).

    Google Scholar 

  38. Lampson, B., H. Sturgis. Crash Recovery in a Distributed Data Storage System. Computer Science Laboratory, Xerox Palo Alto Research Center, Palo Alto, Calif, (1976).

    Google Scholar 

  39. Linden, T.A. Operating System Structures to Support Security and Reliable Software. Comp. Surveys 8, 4 (Dec. 1976), 409–445.

    Google Scholar 

  40. Lomet, D.B. Process Structuring, Synchronisation and Recovery using Atomic Actions. Proc. ACM Conf. On Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 128–137.

    Google Scholar 

  41. McPhee, W.S. Operating System Integrity in OS/VS2. IBM System J. 13, 3 (1974), 230–252.

    Google Scholar 

  42. Melliar-Smith, P.M. Error Detection and Recovery in Data Base Systems. (Unpublished, 1975).

    Google Scholar 

  43. Melliar-Smith, P.M., B. Randell. Software Reliability: the role of programmed exception handling. Proc. ACM Conf. on Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 95–100.

    Google Scholar 

  44. Naur, P. Software Reliability. Infotech State of the Art Conference on Reliable Software, London (1977), 7–13.

    Google Scholar 

  45. Neumann, P.G., J. Goldberg, K.N. Levitt, J.H. Wensley. A Study of Fault-Tolerant Computing. Stanford Research Institute, Menlo Park, California (July 1973).

    Google Scholar 

  46. Ornstein, S.M., W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michael, F.E. Heart. Pluribus — a reliable multi-processor. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1975), 551–559.

    Google Scholar 

  47. Parnas, D.L. Information Distribution Aspects of Design Methodology. Proc. IFIP Congress (1971), TA256-30.

    Google Scholar 

  48. Parnas, D.L., H. Wurges. Response to Undesired Events in Software Systems. Proc. Conf. On Software Engineering. San Francisco, Calif. (1976), 437–446.

    Google Scholar 

  49. Parsons, B.J. Reliability Considerations and Design Aspects of the Hawker Siddeley Space Computer. Proc. Conf. On Computer Systems and Technology, University of Sussex. Inst. Of Electronic and Radio Engineers, London (March 1977), 221–222.

    Google Scholar 

  50. Randell, B. System Structure for Software Fault Tolerance. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 220–232.

    Google Scholar 

  51. Repton, C.S. Reliability Assurance for System 250, a Reliable Real-Time Control System. Int. Conf. On Computer Communications. Washington (Oct. 1972), 297–305.

    Google Scholar 

  52. Rohr, J.A. Starex Self-Repair Routines: Software Recovery in the JPL-STAR Computer. Digest of papers FTC-3, (1973), 11–16.

    Google Scholar 

  53. Ross, D.T. Plex1: Sameness and the Need for Rigor. Report 9031-1.1, Softech, Inc., Waltham, Mass. (Nov. 1975).

    Google Scholar 

  54. Russell, D.L. State Restoration Amongst Communicating Processes. TR 112, Digital Systems Laboratory, Stanford University, Calif. (June 1976).

    Google Scholar 

  55. Shooman, M.L. Probabilistic Reliability: An Engineering Approach. McGraw-Hill, New York (1968).

    Google Scholar 

  56. Simpson, R.M. A Study in the Design of High Integrity Systems. INFO Software, London (1974).

    Google Scholar 

  57. Stoy, J.E., C. Strachey. OS6 — An Experimental Operating System for a Small Computer. Comp. J. 15 (1972), 117–124, 195–201.

    Google Scholar 

  58. Taylor, J.M. Redundancy and Recovery in the HIVE Virtual Machine. Proc. European Conf. on Software System Engineering, London (Sept. 1976), 263–293.

    Google Scholar 

  59. Verhofstad, J.S.M. Recovery for Multi-Level Data Structures. Technical Report No. 96. Computing Laboratory, The University, Newcastle upon Tyne (Dec. 1976).

    Google Scholar 

  60. Verhofstad, J.S.M. Recovery and Crash Resistance in a Filing System. Proc. SIGMOD Conference, Toronto (Aug. 1977).

    Google Scholar 

  61. Wasserman, A.I. Procedure-Oriented Exception Handling Medical Information Science, University of California, San Francisco, Calif. (1976).

    Google Scholar 

  62. Wensley, J.H. SIFT — Software implemented fault tolerance. Proc. Nat. Computer Conf., New York (June 1972), 243–253.

    Google Scholar 

  63. Wulf, W.A. Reliable Hardware-Software Architecture. Proc. Int. Conf. On Reliable Software. SigPlan Notices 10, 6 (June 1975), 122–130.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

R. Bayer R. M. Graham G. Seegmüller

Rights and permissions

Reprints and permissions

Copyright information

© 1978 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Randell, B. (1978). Reliable computing systems. In: Bayer, R., Graham, R.M., Seegmüller, G. (eds) Operating Systems. Lecture Notes in Computer Science, vol 60. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-08755-9_8

Download citation

  • DOI: https://doi.org/10.1007/3-540-08755-9_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-08755-7

  • Online ISBN: 978-3-540-35880-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics