Distributed fault-tolerance
Supporting fault-tolerant distributed computations under real-time requirements

https://doi.org/10.1016/0140-3664(92)90108-QGet rights and content

Abstract

In contrast to conventional (trans)action concepts, the proposed dynamic action model includes the possibility for optimistic recovery to gain high efficiency during normal operation. To minimize time overhead we use a redundant recovery graph to record the necessary recovery information. Based on this graph we provide decentralized protocols that efficiently produce a consistent system state concurrent to normal system activity. Considering real-time applications in distributed systems, error processing time has to be minimized. To achieve this, the proposed concept is extended to the parallel dynamic action scheme where the different versions are executed in parallel. This leads to a recovery concept that combines efficient distributed processing during normal operation and prompt reaction in case of an error.

References (24)

  • T Anderson et al.

    Fault-Tolerance: Principles and Practice

    (1981)
  • B Randell

    System structure for software fault tolerance

    IEEE Trans. Softw. Eng.

    (1975)
  • JN Gray

    Notes on Database Operating Systems

  • PA Bernstein et al.

    Concurrency Control and Recovery in Database Systems

  • E Nett

    Supporting Fault Tolerant Computations in Distributed Systems

  • A Avizienis

    The N-version approach to fault-tolerant software

    IEEE Trans. Softw. Eng.

    (December 1985)
  • KH Kim et al.

    Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications

    IEEE Trans. Comput.

    (May 1989)
  • KH Kim

    Approaches for system-level fault tolerance in distributed real-time computer systems

  • KH Kim et al.

    An analysis of the performance impacts of lookahead execution in the conversation scheme

  • T Anderson et al.

    Software Fault Tolerance: An Evaluation

  • E Nett et al.

    Implementing a general error recovery mechanism in a distributed operating system

    FTCS 16

    (1986)
  • F Christian et al.

    Atomic broadcast: from simple message diffusion to Byzantine agreement

  • View full text