Replica determinism in distributed real-time systems: A brief survey

Poledna, Stefan

doi:10.1007/BF01088629

Replica determinism in distributed real-time systems: A brief survey

Published: May 1994

Volume 6, pages 289–316, (1994)
Cite this article

Real-Time Systems Aims and scope Submit manuscript

Stefan Poledna¹

267 Accesses
33 Citations
3 Altmetric
Explore all metrics

Abstract

Replication of entities is a convenient technique to achieve fault-tolerance. The problem of replica determinism thereby is to assure, that replicated entities show consistent behavior in the absence of failures. Possible sources for replica non-determinism as well as basic requirements and strategies to enforce replica determinism are presented. The problem of replica determinism enforcement under real-time constraints is surveyed in the context of the communication problem for distributed systems. Furthermore the close interdependence between replica determinism on the one side and synchronization strategies, handling of failures and redundancy preservation on the other side is reviewed. The impact of synchronous or asynchronous approaches on replication strategies is also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahamad, M., Dasgupta, P., LeBlanc, R. J., and Wilkes, C. T. 1987. Fault tolerant computing in object based distributed operating systems.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 115–125.
Avizienis, A., and Chen, L. 1977. On the Implementation of N-Version Programming for Software Fault-Tolerance During Programm Execution.Proc. Compsac 77, pp. 149–155. Chicago, IL: Computer Society Press of the IEEE.
Google Scholar
Babaoglu, O., and Drummond, R. 1984. Communication architectures for fast reliable broadcasts.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 2–10.
Babaoglu, O., Stephenson, P., and Drummond, R. 1988.Reliable Broadcasts and Communication Modells: Tradeoffs and Lower bounds. Distr. Comput. Springer-Verlag. Nr. 2. pp. 177–189.
Barret, P. A., Hilborne, A. M., Bond, P. G., Seaton, D. T., Verissimo, P., Rodrigues, L., and Speirs, N. A. 1990. The Delta-4 extra performance architecture (XPA).Proc. 20th Int. Symp. on Fault-Tolerant Computing—FTCS 20, Chapel Hill, NC, pp. 481–488.
Bartlet, J. 1981. A NonStop Kernel.Proc. 8th Symp. on Operating System Principles, pp. 22–29.
Ben-Or, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols.Proc. 2nd ACM Annual Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 27–30.
Bernstein, P. 1988. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing.IEEE Computer, February: 37–45.
Birman, K. P., and Joseph, T. A. 1987a. Exploiting virtual synchronity in distributed systems.Proc. 11th ACM Symp. on Operating System Principles, Austin, TX, pp. 123–128.
Birman, K. P., and Joseph, T. A. (1987b). Reliable communication in the presence of failures.ACM Trans. on Comp. Sys., 5(1):47–76.
Google Scholar
Birman, K. P., Joseph, T. A., Raechle, T., and El Abbadi, A. 1984. Implementing fault-tolerant distributed objects.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 124–133.
Google Scholar
Brilliant, S., Knight, J., and Levenson, N. 1989. The consistent comparision problem in N-version software.IEEE Trans. Software Engineering, 15(11):1481–1485.
Google Scholar
Budhiraja, N., Gopal, A., and Toueg, S. 1990. Early stopping distributed bidding and applications.Proc. 4th Int. Workshop on Distributed Algorithms, pp. 304–320. Springer Verlag, Lecture Notes in Computer Science 486.
Budhiraja, N., Marzullo, K., Schneider, F., and Toueg, S. 1992. Primary-backup protocols: Lower bounds and optimal implementations.Proc. of the 3rd IFIP Int. Working Conf. on Dependable Computing for Critical Applications, Mondello, Sicily, Italy, pp. 187–196.
Google Scholar
Burns, J. E., and Lynch, N. A. 1987. The Byzantine Firing Squad Problem.Advances in Computing Research, 4:147–161.
Google Scholar
Cao, J., and Wang, K. C. 1992. An abstract model of rollback recovery control in distributed systems.Operating Systems Review, 26(4):62–76.
Google Scholar
Chandra, T., and Toueg, S. 1991. Unreliable failure detectors for asynchronous systems.Proc. 10th Annual ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 325–340.
Chang, J. M., and Maxemchuck, N. F. 1984. Reliable broadcast protocols.ACM Trans. on Comput. Sys., 2(3):251–273.
Google Scholar
Chérèque, M., Powell, D., Reynier, P., Richier, J.-L., and Voiron, J. 1992. Active replication in Delta-4.Proc. of the 22th Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Boston, MA, pp. 28–37.
Google Scholar
Coan, B. A., and Dwork, C. 1986. Simultaneity is harder than agreement.Proc. 5th Symp. on Relibility in Distributed Software and Database Systems, Los Angeles, CA, pp. 141–150.
Cooper, E. C. 1984. Circus: A replicated procedure call facility.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 11–24.
Google Scholar
Cristian, F. 1988. Agreeing on who is present and who is absent in a synchronous computer system.Proc. Fault Tolerant Computing, IEEE Computer Society Press, pp. 206–211.
Cristian, F. 1989. Exception handling.Proc. Dependability of Resilient Computers. T. Anderson (Ed). Oxford: Blackwell.
Google Scholar
Cristian, F. 1990. Synchronous atomic broadcast for redundant broadcast channels.The Journal of Real-Time Systems, 2(3):195–212.
Google Scholar
Cristian, F. 1991. Understanding fault-tolerant distributed systems.Comm. of the ACM, 34(2):57–78.
Google Scholar
Cristian, F., Aghili, H., Strong, R., and Dolev, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement.Proc. of the 15th Annual Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Ann Arbor, MI, pp. 200–206.
Google Scholar
Dolev, D. 1982. The Byzantine generals strike again.Journal of Algorithms, 3(1):14–30.
Google Scholar
Dolev, D., and Reischuck, R. 1985. Bounds on information exchange for Byzantine agreement.Journal of the ACM, 32(1):191–204.
Google Scholar
Dolev, D., and Strong, H. 1983. Authenticated algorithms for Byzantine agreement.Siam Journal on Computing, 12(4):656–666.
Google Scholar
Dolev, D., Dwork, C., and Stockmeyer, L. 1987. On the minimal synchronism needed for distributed consensus.Journal of the ACM, 34(1):77–97.
Google Scholar
Dolev, D., Reischuck, R., and Strong, H. R. 1990. Early stopping in Byzantine agreement.Journal of the ACM, 37(4):720–741.
Google Scholar
Dwork, C., Lynch, N., and Stockmeyer, L. 1988. Consensus in the presence of partial synchrony.Journal of the ACM, 35(2):288–323.
Google Scholar
Fischer, M., and Lynch, N. 1982. A lower bound for the time to assure interactive consistency.Information Processing Letters, 14(4):183–186.
Google Scholar
Fischer, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty processor.Journal of the ACM, 32(2):374–382.
Google Scholar
Garcia-Molina, H., and Spauster, A. 1989. Message ordering in a multicast environment.Proc. 9th Int. Conf. on Distributed Computing Systems, IEEE Computer Society Press, pp. 354–361.
Gopal, A., and Toueg, S. 1991. Inconsistency and Contamination.Proc. of the 10th ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 257–272.
Huntsberger, T. 1992. Sensor fusion in a dynamic environment.Proc. on Sensor Fusion V, SPIE—The Int. Society for Optical Engineering, pp. 175–182.
Kaashoek, M. F., and Tanenbaum, A. S. 1991. Group communication in the amoeba distributed operating system.Proc. 11th Int. Conf. on Distributed Computing Systems, Los Alamitos, CA, pp. 222–230.
Kieckhafer, R. M., Thambidurai, P. M., Walter, C. J., and Finn, A. M. 1988. The MAFT architecture for distributed fault-tolerance.IEEE Trans. on Comput., 37(4):394–405.
Google Scholar
Kopetz, H. 1986. Scheduling in distributed real time systems.Proc. Advanced Seminar on Real-Time Local Area Networks, INRIA, Bandol, France, pp. 105–126.
Google Scholar
Kopetz, H. 1992. Sparse time versus dense time in distributed real-time systems.Proc. 12th Int. Conf. on Distributed Computing Systems, Yokohama, Japan, pp. 460–467.
Kopetz, H., and Grünsteidl, G. 1992. TTP—A time triggered protocol for automotive applications. Research Report Nr. 16/1992. Inst. für Technische Informatik, Technische Universit.
Kopetz, H., and Kim, K. 1990. Temporal uncertainties in interaction among real-time objects.Proc. of the 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 165–174.
Kopetz, H., and Ochseneiter, W. 1987. Clock synchronization in distributed real-time systems.IEEE Trans. on Comput., 36(8):933–940.
Google Scholar
Kopetz, H., Damm, A., Koza, C., Mulazzani, M., Senft, C., and Zainlinger, R. 1989. The MARS approach.IEEE Micro., 9(1):25–40.
Google Scholar
Kopetz, H., Grünsteidl, G., and Reisinger, J. 1991. Fault-tolerant membership service in a synchronous distributed real-time system.Proc. Dependable Computing for Critical Applications, Vol. 4 ofDependable Computing and Fault-Tolerant Systems, A. Avizienis and J. C. Laprie (ed.), Springer Verlag, pp. 441–429.
Kopetz, H., Kantz, H., Grünsteidel, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS.Proc. Fault Tolerant computing, Newcastle upon Tyne, UK, pp. 466–473.
Koutny, M., Mancini, L. V., and Pappalardo, G. 1991. Formalising replicated distributed processing.Proc. of the 10th Symp. on Reliable Distributed Systems, Pisa, IT, pp. 108–117.
Lamport, L. 1978. Time, clocks and the ordering of events in a distributed system.Comm. of the ACM, 21(7):558–565.
Google Scholar
Lamport, L. 1984. Using time instead of timeout for fault-tolerant distributed systems.ACM Trans. on Prog. Languages and Systems, 6(2):254–280.
Google Scholar
Lamport, L., and Melliar-Smith, P. M. 1985. Synchronizing clocks in the presence of faults.Journal of the ACM, 32(1):52–78.
Google Scholar
Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine generals problem.ACM Trans. on Prog. Lang. and Sys., 4(3):382–401.
Google Scholar
Laprie, J. C. (Ed). 1992.Dependability: Basic Concepts and Terminology. Volume 5 ofDependable Computing and Fault-Tolerant Systems, Springer Verlag, pp. 23–28.
Lee, P. A., and Anderson, T. 1990.Fault Tolerance. Dependable Computing and Fault-Tolerant Systems, A. Avizyienis, H. Kopetz and J. C. Laprie (Eds), chapter 7, Error Recovery. Springer Verlag, Wien, New York, pp. 143–185.
Google Scholar
Mancini, L., and Pappalardo, G. 1988. Towards a theory of replicated processing.Proc. Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol 331. Springer-Verlag, pp. 175–192.
Marzullo, K. 1990. Tolerating failures of continuous-valued sensors.ACM Trans. on Comp. Sys., 8(4):284–304.
Google Scholar
Melliar-Smith, P. M., and Moser, L.E. 1989. Fault-tolerant distributed systems based on broadcast communication.Proc. 9th Int. Conf. on Distributed Computing Systems, pp. 129–134.
Mishra, S., Peterson, L. L., and Schlichting, R. D. 1989. Implementing fault-tolerant replicated objects using Psync.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 42–52.
Palumbo, D. L., and Butler, R. W. 1985. Measurement of SIFT operating system overhead. Technical Memo 86322. NASA.
Pease, M., Shostak, R., and Lamport, L. 1980. Reaching agreement in the presence of faults.Journal of the ACM, 26(2):228–234.
Google Scholar
Powel, D. (Ed) 1991a. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.3, Models of Distributed Computation. Springer Verlag. pp. 99–100.
Powell, D. (Ed) 1991b. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, Chapter 6.4, Replicated Software Components. Springer Verlag, pp. 100–104.
Powell, D. (Ed) 1991c. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.7, Semi-Active Replication. Springer Verlag, pp. 116–120.
Powell, D. (Ed) 1991d. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.6, Passive Replication. Springer Verlag, pp. 111–115.
Powell, D. (Ed) 1991e. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 10.6, Two-Phase Accept Protocol. Springer Verlag, pp. 282–284.
Reisinger, J. 1989. Failure Modes and Failure Characteristics of a TDMA Driven Ethernet. Research Report 8/89, Inst. für Technische Informatik, Technische Universität Wien, Austria.
Google Scholar
Schlichting, R. D., and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems.ACM Trans. on Comput. Sys. 1(3):222–238.
Google Scholar
Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine appoach: A tutorial.ACM Computing Surveys 22(4):299–319. Schneider, M. 1993. Self-stabilization.ACM Computing Surveys 25(1):45–67.
Google Scholar
Shi, S. S. B., and Belford, G. G. 1989. Consistent replicated transactions.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 30–41.
Shin, K. G., Lin, T.-H., and Lee, Y.-H. 1986. Optimal checkpointing of real-time tasks.Proc. on the 5th Symp. on Reliability in Distributed Software and Database Systems, Los Angeles, CA, pp. 151–158.
Tanenbaum, A. S., et al. 1990. Experiences with the amoeba distributed operating system.Comm. of the ACM 33:46–63.
Google Scholar
Taylor, D., and Wilson, G. 1989. The stratus system architecture.Proc. Dependability of Resilient Computers, T. Anderson, Ed. Oxford: Blackwell.
Google Scholar
Toueg, S., Perry K. J., and Srikanth, T. K. 1987. Fast distributed agreement.SIAM Journal on Computing 16(3):445–457.
Google Scholar
Tully, A., and Shrivastava, S. K. 1990. Preventing state divergence in replicated distributed programs.Proc. 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 104–113.
Veríssimo, P. 1990. Real-time data management with clock-less reliable broadcast protocols.Proc. of the Workshop on Managment of Replicated Data, Houston, pp. 20–24.
Veríssimo, P., Rodrigues, L., and Baptista, M. 1989. AMp: A highly parallel atomic multicast protocol.Proc. SIGCOMM Symp. ACM, Austin, pp. 83–93.
Von Neumann, J. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. InAutomata Studies, C. E. Shannon and J. McCarthy (Ed), pp. 43–98. Princeton University Press.
Wensly, J. H., Lamport, L., Goldberg, J., Green, M. W., Levitt, K. N., Mellinar-Smith, P. M., Shostack, R. E., and Weinstock, C. B. 1978. SIFT: The design and analysis of a fault-tolerant computer for aircraft control.Proc. of the IEEE 66(10):1240–1255.
Google Scholar
Wu, K. L., Yu, P. S., and Pu, C. 1991. Divergence control for epsilon-serialisability. Technical report CUCS-002-91, Department of Computer Science, Columbia University. Also available as IBM Tech Report No. RC16598.

Download references

Author information

Authors and Affiliations

Institut für Technische Informatik Technische Universität Wien, A-1040, Vienna, Austria
Stefan Poledna

Authors

Stefan Poledna
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poledna, S. Replica determinism in distributed real-time systems: A brief survey. Real-Time Syst 6, 289–316 (1994). https://doi.org/10.1007/BF01088629

Download citation

Issue Date: May 1994
DOI: https://doi.org/10.1007/BF01088629

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Replica determinism in distributed real-time systems: A brief survey

Abstract

Access this article

Similar content being viewed by others

Consistency in Distributed Systems

Brief Announcement: Byzantine-Tolerant Detection of Causality in Synchronous Systems

Bounded Version Vectors Using Mazurkiewicz Traces

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Replica determinism in distributed real-time systems: A brief survey

Abstract

Access this article

Similar content being viewed by others

Consistency in Distributed Systems

Brief Announcement: Byzantine-Tolerant Detection of Causality in Synchronous Systems

Bounded Version Vectors Using Mazurkiewicz Traces

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation