Skip to main content
Log in

Fault tolerant processes

  • Published:
Distributed Computing Aims and scope Submit manuscript

Abstract

A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allchin JE, McKendry MS (1983) Synchronization and recovery of actions. Proceedings of Symposium on Principles of Distributed Computing. ACM SIGACT-SIGOPS, Montreal, California, pp 17–19

  • Alsberg PA, Day JD (1976) A principle for resilient sharing of distributed resources. Proceedings of the International Conference on Software Engineering, San Francisco, pp 562–570

  • Bartlett JF (1981) A NonStop kernel. Proceedings of 7th ACM Symposium on Operating Systems Principles, pp 22–29

  • Birman KP, Joseph TA (1985) Reliable communication in unreliable networks. ACM Trans Comput Syst 47–76

  • Birman KP, Joseph TA, Raeuchle T, Abbadi E (1985) Implementing fault-tolerant distributed objects. IEEE Trans Software Eng SE-11:502–508

    Google Scholar 

  • Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. 9th AMC Symposium on Operating Systems Principles. Operat Syst Rev 17:90–99

    Google Scholar 

  • Chang JM, Maxemchuk NF (1984) Reliable broadcast protocols. ACM Trans Comput Syst 2:251–273

    Google Scholar 

  • Cristian F, Aghili H, Strong R (1985) Atomic broadcast: from simple message diffusion to byzantine agreement. Digest of Papers: The 15th Fault Tolerant Computing Symposium, 1985, pp 200/206

  • Jalote P (1989) Resilient objects in broadcast networks. IEEE Trans Software Eng 15:68–72

    Google Scholar 

  • Johnson DB, Zwaenepoel W (1987) Sender-based message logging. Digest of Papers: The 17th International Fault Tolerant Computing Symposium, July 1987, Pittsburgh, pp 14–19

  • Lamport L (1983) Specifying concurrent program modules. ACM Trans Program Lang Syst 5:190–222

    Google Scholar 

  • Lin K-J, Gannon J (1985) Atomic remote procedure call. IEEE Trans Software Eng SE-11:1126–1135

    Google Scholar 

  • Liskov BH, Scheifler R (1983) Guardians and actions: linguistic support for robust, distributed programs. ACM Trans Program Lang Syst 5:381–404

    Google Scholar 

  • Powell ML, Presotto DL (1983) PUBLISHING: a reliable broadcast communication mechanism. 9th ACM Symposium on Operating Systems Principles. Operat Syst Rev 17:100–109

    Google Scholar 

  • Randell B (1985) System structure for software fault tolerance. IEEE Trans Software Eng SE-1:220–232

    Google Scholar 

  • Reed DP (1983) Implementing atomic actions on decentralized data. ACM Trans Comput Syst 1:3–23

    Google Scholar 

  • Schlichting RD, Schneider FB (1983) Fall-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans Comput Syst 1:222–238

    Google Scholar 

  • Schneider FB (1982) Synchronization in distributed programs. ACM Trans Program Lang Syst 4:179–195

    Google Scholar 

  • Strom RE, Yemini S (1984) Optimistic recovery: an asynchronous approach to fault-tolerance in distributed systems. Digest of Papers: The 14th International Fault Tolerant Computing Symposium, 1984, Florida, pp 374–379

  • Svobodova L (1984) Resilient distributed computing. IEEE Trans Software Eng SE-10:257–268

    Google Scholar 

  • Walker B, Popek G, English R, Kline C, Thiel G (1983) The LOCUS distributed operating system. Proceedings of the 9th ACM Symposium on Operating Systems Principles, Bretton Woods, Oct 1983, pp 49–70

Download references

Author information

Authors and Affiliations

Authors

Additional information

Pankaj Jalote received the Bachelor of Technology degree in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 1980, the M.S. degree in computer science from Pennsylvania State University, University Park, in 1982, and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign in 1985. From August 1985 to July 1989 he was an Assistant Professor in the Department of Computer Science at the University of Maryland, College Park. Currently he is an Assistant Professor in the Department of Computer Science and Engineering at IIT Kanpur, India. His research interests include fault-tolerant computing, distributed systems, and software engineering.

This work was supported in parts by the NSF grant DCI-8610337

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jalote, P. Fault tolerant processes. Distrib Comput 3, 187–195 (1989). https://doi.org/10.1007/BF01784887

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01784887

Key words

Navigation