Skip to main content
Log in

Replica determinism in distributed real-time systems: A brief survey

  • Published:
Real-Time Systems Aims and scope Submit manuscript

Abstract

Replication of entities is a convenient technique to achieve fault-tolerance. The problem of replica determinism thereby is to assure, that replicated entities show consistent behavior in the absence of failures. Possible sources for replica non-determinism as well as basic requirements and strategies to enforce replica determinism are presented. The problem of replica determinism enforcement under real-time constraints is surveyed in the context of the communication problem for distributed systems. Furthermore the close interdependence between replica determinism on the one side and synchronization strategies, handling of failures and redundancy preservation on the other side is reviewed. The impact of synchronous or asynchronous approaches on replication strategies is also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ahamad, M., Dasgupta, P., LeBlanc, R. J., and Wilkes, C. T. 1987. Fault tolerant computing in object based distributed operating systems.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 115–125.

  • Avizienis, A., and Chen, L. 1977. On the Implementation of N-Version Programming for Software Fault-Tolerance During Programm Execution.Proc. Compsac 77, pp. 149–155. Chicago, IL: Computer Society Press of the IEEE.

    Google Scholar 

  • Babaoglu, O., and Drummond, R. 1984. Communication architectures for fast reliable broadcasts.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 2–10.

  • Babaoglu, O., Stephenson, P., and Drummond, R. 1988.Reliable Broadcasts and Communication Modells: Tradeoffs and Lower bounds. Distr. Comput. Springer-Verlag. Nr. 2. pp. 177–189.

  • Barret, P. A., Hilborne, A. M., Bond, P. G., Seaton, D. T., Verissimo, P., Rodrigues, L., and Speirs, N. A. 1990. The Delta-4 extra performance architecture (XPA).Proc. 20th Int. Symp. on Fault-Tolerant Computing—FTCS 20, Chapel Hill, NC, pp. 481–488.

  • Bartlet, J. 1981. A NonStop Kernel.Proc. 8th Symp. on Operating System Principles, pp. 22–29.

  • Ben-Or, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols.Proc. 2nd ACM Annual Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 27–30.

  • Bernstein, P. 1988. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing.IEEE Computer, February: 37–45.

  • Birman, K. P., and Joseph, T. A. 1987a. Exploiting virtual synchronity in distributed systems.Proc. 11th ACM Symp. on Operating System Principles, Austin, TX, pp. 123–128.

  • Birman, K. P., and Joseph, T. A. (1987b). Reliable communication in the presence of failures.ACM Trans. on Comp. Sys., 5(1):47–76.

    Google Scholar 

  • Birman, K. P., Joseph, T. A., Raechle, T., and El Abbadi, A. 1984. Implementing fault-tolerant distributed objects.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 124–133.

    Google Scholar 

  • Brilliant, S., Knight, J., and Levenson, N. 1989. The consistent comparision problem in N-version software.IEEE Trans. Software Engineering, 15(11):1481–1485.

    Google Scholar 

  • Budhiraja, N., Gopal, A., and Toueg, S. 1990. Early stopping distributed bidding and applications.Proc. 4th Int. Workshop on Distributed Algorithms, pp. 304–320. Springer Verlag, Lecture Notes in Computer Science 486.

  • Budhiraja, N., Marzullo, K., Schneider, F., and Toueg, S. 1992. Primary-backup protocols: Lower bounds and optimal implementations.Proc. of the 3rd IFIP Int. Working Conf. on Dependable Computing for Critical Applications, Mondello, Sicily, Italy, pp. 187–196.

    Google Scholar 

  • Burns, J. E., and Lynch, N. A. 1987. The Byzantine Firing Squad Problem.Advances in Computing Research, 4:147–161.

    Google Scholar 

  • Cao, J., and Wang, K. C. 1992. An abstract model of rollback recovery control in distributed systems.Operating Systems Review, 26(4):62–76.

    Google Scholar 

  • Chandra, T., and Toueg, S. 1991. Unreliable failure detectors for asynchronous systems.Proc. 10th Annual ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 325–340.

  • Chang, J. M., and Maxemchuck, N. F. 1984. Reliable broadcast protocols.ACM Trans. on Comput. Sys., 2(3):251–273.

    Google Scholar 

  • Chérèque, M., Powell, D., Reynier, P., Richier, J.-L., and Voiron, J. 1992. Active replication in Delta-4.Proc. of the 22th Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Boston, MA, pp. 28–37.

    Google Scholar 

  • Coan, B. A., and Dwork, C. 1986. Simultaneity is harder than agreement.Proc. 5th Symp. on Relibility in Distributed Software and Database Systems, Los Angeles, CA, pp. 141–150.

  • Cooper, E. C. 1984. Circus: A replicated procedure call facility.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 11–24.

    Google Scholar 

  • Cristian, F. 1988. Agreeing on who is present and who is absent in a synchronous computer system.Proc. Fault Tolerant Computing, IEEE Computer Society Press, pp. 206–211.

  • Cristian, F. 1989. Exception handling.Proc. Dependability of Resilient Computers. T. Anderson (Ed). Oxford: Blackwell.

    Google Scholar 

  • Cristian, F. 1990. Synchronous atomic broadcast for redundant broadcast channels.The Journal of Real-Time Systems, 2(3):195–212.

    Google Scholar 

  • Cristian, F. 1991. Understanding fault-tolerant distributed systems.Comm. of the ACM, 34(2):57–78.

    Google Scholar 

  • Cristian, F., Aghili, H., Strong, R., and Dolev, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement.Proc. of the 15th Annual Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Ann Arbor, MI, pp. 200–206.

    Google Scholar 

  • Dolev, D. 1982. The Byzantine generals strike again.Journal of Algorithms, 3(1):14–30.

    Google Scholar 

  • Dolev, D., and Reischuck, R. 1985. Bounds on information exchange for Byzantine agreement.Journal of the ACM, 32(1):191–204.

    Google Scholar 

  • Dolev, D., and Strong, H. 1983. Authenticated algorithms for Byzantine agreement.Siam Journal on Computing, 12(4):656–666.

    Google Scholar 

  • Dolev, D., Dwork, C., and Stockmeyer, L. 1987. On the minimal synchronism needed for distributed consensus.Journal of the ACM, 34(1):77–97.

    Google Scholar 

  • Dolev, D., Reischuck, R., and Strong, H. R. 1990. Early stopping in Byzantine agreement.Journal of the ACM, 37(4):720–741.

    Google Scholar 

  • Dwork, C., Lynch, N., and Stockmeyer, L. 1988. Consensus in the presence of partial synchrony.Journal of the ACM, 35(2):288–323.

    Google Scholar 

  • Fischer, M., and Lynch, N. 1982. A lower bound for the time to assure interactive consistency.Information Processing Letters, 14(4):183–186.

    Google Scholar 

  • Fischer, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty processor.Journal of the ACM, 32(2):374–382.

    Google Scholar 

  • Garcia-Molina, H., and Spauster, A. 1989. Message ordering in a multicast environment.Proc. 9th Int. Conf. on Distributed Computing Systems, IEEE Computer Society Press, pp. 354–361.

  • Gopal, A., and Toueg, S. 1991. Inconsistency and Contamination.Proc. of the 10th ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 257–272.

  • Huntsberger, T. 1992. Sensor fusion in a dynamic environment.Proc. on Sensor Fusion V, SPIE—The Int. Society for Optical Engineering, pp. 175–182.

  • Kaashoek, M. F., and Tanenbaum, A. S. 1991. Group communication in the amoeba distributed operating system.Proc. 11th Int. Conf. on Distributed Computing Systems, Los Alamitos, CA, pp. 222–230.

  • Kieckhafer, R. M., Thambidurai, P. M., Walter, C. J., and Finn, A. M. 1988. The MAFT architecture for distributed fault-tolerance.IEEE Trans. on Comput., 37(4):394–405.

    Google Scholar 

  • Kopetz, H. 1986. Scheduling in distributed real time systems.Proc. Advanced Seminar on Real-Time Local Area Networks, INRIA, Bandol, France, pp. 105–126.

    Google Scholar 

  • Kopetz, H. 1992. Sparse time versus dense time in distributed real-time systems.Proc. 12th Int. Conf. on Distributed Computing Systems, Yokohama, Japan, pp. 460–467.

  • Kopetz, H., and Grünsteidl, G. 1992. TTP—A time triggered protocol for automotive applications. Research Report Nr. 16/1992. Inst. für Technische Informatik, Technische Universit.

  • Kopetz, H., and Kim, K. 1990. Temporal uncertainties in interaction among real-time objects.Proc. of the 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 165–174.

  • Kopetz, H., and Ochseneiter, W. 1987. Clock synchronization in distributed real-time systems.IEEE Trans. on Comput., 36(8):933–940.

    Google Scholar 

  • Kopetz, H., Damm, A., Koza, C., Mulazzani, M., Senft, C., and Zainlinger, R. 1989. The MARS approach.IEEE Micro., 9(1):25–40.

    Google Scholar 

  • Kopetz, H., Grünsteidl, G., and Reisinger, J. 1991. Fault-tolerant membership service in a synchronous distributed real-time system.Proc. Dependable Computing for Critical Applications, Vol. 4 ofDependable Computing and Fault-Tolerant Systems, A. Avizienis and J. C. Laprie (ed.), Springer Verlag, pp. 441–429.

  • Kopetz, H., Kantz, H., Grünsteidel, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS.Proc. Fault Tolerant computing, Newcastle upon Tyne, UK, pp. 466–473.

  • Koutny, M., Mancini, L. V., and Pappalardo, G. 1991. Formalising replicated distributed processing.Proc. of the 10th Symp. on Reliable Distributed Systems, Pisa, IT, pp. 108–117.

  • Lamport, L. 1978. Time, clocks and the ordering of events in a distributed system.Comm. of the ACM, 21(7):558–565.

    Google Scholar 

  • Lamport, L. 1984. Using time instead of timeout for fault-tolerant distributed systems.ACM Trans. on Prog. Languages and Systems, 6(2):254–280.

    Google Scholar 

  • Lamport, L., and Melliar-Smith, P. M. 1985. Synchronizing clocks in the presence of faults.Journal of the ACM, 32(1):52–78.

    Google Scholar 

  • Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine generals problem.ACM Trans. on Prog. Lang. and Sys., 4(3):382–401.

    Google Scholar 

  • Laprie, J. C. (Ed). 1992.Dependability: Basic Concepts and Terminology. Volume 5 ofDependable Computing and Fault-Tolerant Systems, Springer Verlag, pp. 23–28.

  • Lee, P. A., and Anderson, T. 1990.Fault Tolerance. Dependable Computing and Fault-Tolerant Systems, A. Avizyienis, H. Kopetz and J. C. Laprie (Eds), chapter 7, Error Recovery. Springer Verlag, Wien, New York, pp. 143–185.

    Google Scholar 

  • Mancini, L., and Pappalardo, G. 1988. Towards a theory of replicated processing.Proc. Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol 331. Springer-Verlag, pp. 175–192.

  • Marzullo, K. 1990. Tolerating failures of continuous-valued sensors.ACM Trans. on Comp. Sys., 8(4):284–304.

    Google Scholar 

  • Melliar-Smith, P. M., and Moser, L.E. 1989. Fault-tolerant distributed systems based on broadcast communication.Proc. 9th Int. Conf. on Distributed Computing Systems, pp. 129–134.

  • Mishra, S., Peterson, L. L., and Schlichting, R. D. 1989. Implementing fault-tolerant replicated objects using Psync.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 42–52.

  • Palumbo, D. L., and Butler, R. W. 1985. Measurement of SIFT operating system overhead. Technical Memo 86322. NASA.

  • Pease, M., Shostak, R., and Lamport, L. 1980. Reaching agreement in the presence of faults.Journal of the ACM, 26(2):228–234.

    Google Scholar 

  • Powel, D. (Ed) 1991a. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.3, Models of Distributed Computation. Springer Verlag. pp. 99–100.

  • Powell, D. (Ed) 1991b. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, Chapter 6.4, Replicated Software Components. Springer Verlag, pp. 100–104.

  • Powell, D. (Ed) 1991c. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.7, Semi-Active Replication. Springer Verlag, pp. 116–120.

  • Powell, D. (Ed) 1991d. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.6, Passive Replication. Springer Verlag, pp. 111–115.

  • Powell, D. (Ed) 1991e. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 10.6, Two-Phase Accept Protocol. Springer Verlag, pp. 282–284.

  • Reisinger, J. 1989. Failure Modes and Failure Characteristics of a TDMA Driven Ethernet. Research Report 8/89, Inst. für Technische Informatik, Technische Universität Wien, Austria.

    Google Scholar 

  • Schlichting, R. D., and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems.ACM Trans. on Comput. Sys. 1(3):222–238.

    Google Scholar 

  • Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine appoach: A tutorial.ACM Computing Surveys 22(4):299–319. Schneider, M. 1993. Self-stabilization.ACM Computing Surveys 25(1):45–67.

    Google Scholar 

  • Shi, S. S. B., and Belford, G. G. 1989. Consistent replicated transactions.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 30–41.

  • Shin, K. G., Lin, T.-H., and Lee, Y.-H. 1986. Optimal checkpointing of real-time tasks.Proc. on the 5th Symp. on Reliability in Distributed Software and Database Systems, Los Angeles, CA, pp. 151–158.

  • Tanenbaum, A. S., et al. 1990. Experiences with the amoeba distributed operating system.Comm. of the ACM 33:46–63.

    Google Scholar 

  • Taylor, D., and Wilson, G. 1989. The stratus system architecture.Proc. Dependability of Resilient Computers, T. Anderson, Ed. Oxford: Blackwell.

    Google Scholar 

  • Toueg, S., Perry K. J., and Srikanth, T. K. 1987. Fast distributed agreement.SIAM Journal on Computing 16(3):445–457.

    Google Scholar 

  • Tully, A., and Shrivastava, S. K. 1990. Preventing state divergence in replicated distributed programs.Proc. 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 104–113.

  • Veríssimo, P. 1990. Real-time data management with clock-less reliable broadcast protocols.Proc. of the Workshop on Managment of Replicated Data, Houston, pp. 20–24.

  • Veríssimo, P., Rodrigues, L., and Baptista, M. 1989. AMp: A highly parallel atomic multicast protocol.Proc. SIGCOMM Symp. ACM, Austin, pp. 83–93.

  • Von Neumann, J. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. InAutomata Studies, C. E. Shannon and J. McCarthy (Ed), pp. 43–98. Princeton University Press.

  • Wensly, J. H., Lamport, L., Goldberg, J., Green, M. W., Levitt, K. N., Mellinar-Smith, P. M., Shostack, R. E., and Weinstock, C. B. 1978. SIFT: The design and analysis of a fault-tolerant computer for aircraft control.Proc. of the IEEE 66(10):1240–1255.

    Google Scholar 

  • Wu, K. L., Yu, P. S., and Pu, C. 1991. Divergence control for epsilon-serialisability. Technical report CUCS-002-91, Department of Computer Science, Columbia University. Also available as IBM Tech Report No. RC16598.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poledna, S. Replica determinism in distributed real-time systems: A brief survey. Real-Time Syst 6, 289–316 (1994). https://doi.org/10.1007/BF01088629

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01088629

Keywords

Navigation