Supporting fault-tolerant distributed computations under real-time requirements

doi:10.1016/0140-3664(92)90108-Q

Computer Communications

Volume 15, Issue 4, May 1992, Pages 252-260

https://doi.org/10.1016/0140-3664(92)90108-Q Get rights and content

Abstract

In contrast to conventional (trans)action concepts, the proposed dynamic action model includes the possibility for optimistic recovery to gain high efficiency during normal operation. To minimize time overhead we use a redundant recovery graph to record the necessary recovery information. Based on this graph we provide decentralized protocols that efficiently produce a consistent system state concurrent to normal system activity. Considering real-time applications in distributed systems, error processing time has to be minimized. To achieve this, the proposed concept is extended to the parallel dynamic action scheme where the different versions are executed in parallel. This leads to a recovery concept that combines efficient distributed processing during normal operation and prompt reaction in case of an error.

References (24)

T Anderson et al.
Fault-Tolerance: Principles and Practice
(1981)
B Randell
System structure for software fault tolerance
IEEE Trans. Softw. Eng.
(1975)
JN Gray
Notes on Database Operating Systems
PA Bernstein et al.
Concurrency Control and Recovery in Database Systems
E Nett
Supporting Fault Tolerant Computations in Distributed Systems
A Avizienis
The N-version approach to fault-tolerant software
IEEE Trans. Softw. Eng.
(December 1985)
KH Kim et al.
Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications
IEEE Trans. Comput.
(May 1989)
KH Kim
Approaches for system-level fault tolerance in distributed real-time computer systems
KH Kim et al.
An analysis of the performance impacts of lookahead execution in the conversation scheme
T Anderson et al.
Software Fault Tolerance: An Evaluation

E Nett et al.

Implementing a general error recovery mechanism in a distributed operating system

FTCS 16

(1986)

F Christian et al.

Atomic broadcast: from simple message diffusion to Byzantine agreement

Cited by (2)

The PSTR/SNS scheme for real-time fault tolerance via active object replication and network surveillance
2000, IEEE Transactions on Knowledge and Data Engineering
Fault Tolerance in Highly Parallel Hardware Systems
1994, IEEE Micro

View full text

Distributed fault-toleranceSupporting fault-tolerant distributed computations under real-time requirements

Abstract

Fault-Tolerance: Principles and Practice

System structure for software fault tolerance

IEEE Trans. Softw. Eng.

Notes on Database Operating Systems

Concurrency Control and Recovery in Database Systems

Supporting Fault Tolerant Computations in Distributed Systems

The N-version approach to fault-tolerant software

IEEE Trans. Softw. Eng.

Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications

IEEE Trans. Comput.

Approaches for system-level fault tolerance in distributed real-time computer systems

An analysis of the performance impacts of lookahead execution in the conversation scheme

Software Fault Tolerance: An Evaluation

Implementing a general error recovery mechanism in a distributed operating system

FTCS 16

Atomic broadcast: from simple message diffusion to Byzantine agreement

Distributed fault-tolerance
Supporting fault-tolerant distributed computations under real-time requirements