Fault tolerant processes

Jalote, Pankaj

doi:10.1007/BF01784887

Fault tolerant processes

Published: December 1989

Volume 3, pages 187–195, (1989)
Cite this article

Distributed Computing Aims and scope Submit manuscript

Pankaj Jalote¹

69 Accesses
21 Citations
Explore all metrics

Abstract

A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fault-Tolerant Multiparty Session Types

A Closer Look at Fault Tolerance

Article 15 May 2017

Gadi Taubenfeld

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Article 18 August 2023

Elias P. Duarte Jr., Luiz A. Rodrigues, … Rogério C. Turchetti

References

Allchin JE, McKendry MS (1983) Synchronization and recovery of actions. Proceedings of Symposium on Principles of Distributed Computing. ACM SIGACT-SIGOPS, Montreal, California, pp 17–19
Alsberg PA, Day JD (1976) A principle for resilient sharing of distributed resources. Proceedings of the International Conference on Software Engineering, San Francisco, pp 562–570
Bartlett JF (1981) A NonStop kernel. Proceedings of 7th ACM Symposium on Operating Systems Principles, pp 22–29
Birman KP, Joseph TA (1985) Reliable communication in unreliable networks. ACM Trans Comput Syst 47–76
Birman KP, Joseph TA, Raeuchle T, Abbadi E (1985) Implementing fault-tolerant distributed objects. IEEE Trans Software Eng SE-11:502–508
Google Scholar
Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. 9th AMC Symposium on Operating Systems Principles. Operat Syst Rev 17:90–99
Google Scholar
Chang JM, Maxemchuk NF (1984) Reliable broadcast protocols. ACM Trans Comput Syst 2:251–273
Google Scholar
Cristian F, Aghili H, Strong R (1985) Atomic broadcast: from simple message diffusion to byzantine agreement. Digest of Papers: The 15th Fault Tolerant Computing Symposium, 1985, pp 200/206
Jalote P (1989) Resilient objects in broadcast networks. IEEE Trans Software Eng 15:68–72
Google Scholar
Johnson DB, Zwaenepoel W (1987) Sender-based message logging. Digest of Papers: The 17th International Fault Tolerant Computing Symposium, July 1987, Pittsburgh, pp 14–19
Lamport L (1983) Specifying concurrent program modules. ACM Trans Program Lang Syst 5:190–222
Google Scholar
Lin K-J, Gannon J (1985) Atomic remote procedure call. IEEE Trans Software Eng SE-11:1126–1135
Google Scholar
Liskov BH, Scheifler R (1983) Guardians and actions: linguistic support for robust, distributed programs. ACM Trans Program Lang Syst 5:381–404
Google Scholar
Powell ML, Presotto DL (1983) PUBLISHING: a reliable broadcast communication mechanism. 9th ACM Symposium on Operating Systems Principles. Operat Syst Rev 17:100–109
Google Scholar
Randell B (1985) System structure for software fault tolerance. IEEE Trans Software Eng SE-1:220–232
Google Scholar
Reed DP (1983) Implementing atomic actions on decentralized data. ACM Trans Comput Syst 1:3–23
Google Scholar
Schlichting RD, Schneider FB (1983) Fall-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans Comput Syst 1:222–238
Google Scholar
Schneider FB (1982) Synchronization in distributed programs. ACM Trans Program Lang Syst 4:179–195
Google Scholar
Strom RE, Yemini S (1984) Optimistic recovery: an asynchronous approach to fault-tolerance in distributed systems. Digest of Papers: The 14th International Fault Tolerant Computing Symposium, 1984, Florida, pp 374–379
Svobodova L (1984) Resilient distributed computing. IEEE Trans Software Eng SE-10:257–268
Google Scholar
Walker B, Popek G, English R, Kline C, Thiel G (1983) The LOCUS distributed operating system. Proceedings of the 9th ACM Symposium on Operating Systems Principles, Bretton Woods, Oct 1983, pp 49–70

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Maryland, 20742, College Park, MD, USA
Pankaj Jalote

Authors

Pankaj Jalote
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Pankaj Jalote received the Bachelor of Technology degree in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 1980, the M.S. degree in computer science from Pennsylvania State University, University Park, in 1982, and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign in 1985. From August 1985 to July 1989 he was an Assistant Professor in the Department of Computer Science at the University of Maryland, College Park. Currently he is an Assistant Professor in the Department of Computer Science and Engineering at IIT Kanpur, India. His research interests include fault-tolerant computing, distributed systems, and software engineering.

This work was supported in parts by the NSF grant DCI-8610337

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jalote, P. Fault tolerant processes. Distrib Comput 3, 187–195 (1989). https://doi.org/10.1007/BF01784887

Download citation

Issue Date: December 1989
DOI: https://doi.org/10.1007/BF01784887

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fault tolerant processes

Abstract

Access this article

Similar content being viewed by others

Fault-Tolerant Multiparty Session Types

A Closer Look at Fault Tolerance

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Fault tolerant processes

Abstract

Access this article

Similar content being viewed by others

Fault-Tolerant Multiparty Session Types

A Closer Look at Fault Tolerance

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation