Reliability and performance analysis of hardware–software systems with fault-tolerant software components

doi:10.1016/j.ress.2005.04.004

Reliability Engineering & System Safety

Volume 91, Issue 5, May 2006, Pages 570-579

https://doi.org/10.1016/j.ress.2005.04.004 Get rights and content

Abstract

This paper presents an algorithm for evaluating reliability and expected execution time for systems consisting of fault-tolerant software components running on several hardware units. The components are built from functionally equivalent but independently developed versions characterized by different reliability and execution time. Different number of versions can be executed simultaneously depending on the number of available units. The system reliability is defined as the probability that the system produces a correct output in a specified time.

Introduction

Software failures are caused by errors made in various phases of program development. When the software reliability is of critical importance, special programming techniques are used in order to achieve its fault tolerance. Two of the best-known fault-tolerant software design methods are N-version programming (NVP) and recovery block scheme (RBS) [1]. Both methods are based on the redundancy of software modules (functionally equivalent but independently developed) and the assumption that coincident failures of modules are rare. The fault tolerance usually requires additional resources and results in performance penalties (particularly with regard to computation time), which constitutes a tradeoff between software performance and reliability.

NVP was proposed by Chen and Avizienis [2]. This approach presumes the execution of N functionally equivalent software modules (called versions) that receive the same input and send their outputs to a voter, which is aimed at determining the system output. The voter produces an output if at least M out of N outputs agree (it is presumed that the probability that M wrong outputs agree is negligibly small). Otherwise, the system fails. Usually majority voting is used in which N is odd and M=(N+1)/2.

In some applications, the available computational resources do not allow all of the versions to be executed simultaneously. In these cases, the versions are executed according to some predefined sequence and the program execution terminates either when M versions produce the same output (success) or when after the execution of all the N versions the number of equivalent outputs is less than M (failure). The entire program execution time is a random variable depending on the parameters of the versions and on the number of versions that can be executed simultaneously.

RBS was proposed by Randell [3]. In this approach after execution of each version, its output is tested by an acceptance test block (ATB). If the ATB accepts the version output, the process is terminated and the version output becomes the output of the entire system. If all N versions do not produce the accepted output, the system fails. If the computational resources allow simultaneous execution of several versions, the versions are executed according to some predefined sequence and the entire program terminates either when one of versions produces the output accepted by the ATB (success) or when after the execution of all the N versions no output is accepted by the ATB (failure). If the acceptance test time is included into the execution time of each version, the RBS performance model becomes identical to the performance model of the NVP with M=1.

Estimating the effect of the fault-tolerant programming on system performance is especially important in safety critical real-time computer applications. This effect has been studied by Tai et al. in [4] and by Goseva-Popstojanova and Grnarov in [5], [6]. While in [4] a basic realization of NVP (N=3, M=2) consisting of versions with identical fault probabilities and different execution times has been considered, in [5], [6] NVP with arbitrary N has been studied in which both times to failure and execution times of different versions are identically distributed random variables.

In many cases, the information about version reliability and execution time is available from separate testing and/or reliability prediction models [7]. This information can be incorporated into a fault-tolerant program model in order to obtain an evaluation of its reliability and performance. The reliability model of NVP with versions having different reliability has been considered in [8]. However, in this study, the system performance evaluation problem has not been addressed and a general algorithm for evaluating NVP reliability for arbitrary N and M has not been suggested.

Since the performance of fault-tolerant programs depends on availability of computational resources, the impact of hardware availability should be taken into account when the system availability is evaluated. In [9] several simple configurations of hardware–software systems with N≤3 and number of hardware units not greater than three have been studied also without considering system performance.

This paper presents an algorithm for finding the reliability and performance measures for arbitrary fault-tolerant hardware–software systems with given hardware structure and availability of hardware units and given parameters of software versions. The novelty of the presented algorithm lies in its ability to take into account both hardware and software reliability for arbitrary number of software versions and hardware units and to evaluate both system reliability and performance measures.

The algorithm does not take into account the common cause failures which leads to overestimation of system reliability. However, even such optimistic estimates can be used for comparison of different system architectures and for optimization of software system structure as it has been done in [8], [10], [11], [12].

The probabilities of common cause failures can be evaluated separately (elicited from experimental study) and added to system unreliability in order to obtain more accurate reliability estimates.

Section snippets

Model

According to the model presented in [8] the software system consists of C components. Each component performs a subtask and the sequential execution of the components performs a major task. Such series architecture can be found in many applications where the output of a component is fed to the next component as its input. An example of such architecture is a speech recognition system presented in [9]. Performance of different applications and services in the grid systems can also be represented

Number of versions that can be executed simultaneously

The number of available hardware units in component c can vary from 0 to H_c. Given all of the units are identical and have availability A_c, one can easily obtain probabilities Pr{h_c=x} for 1≤x≤H_c: $Q_{c} (x) = Pr {h_{c} = x} = (\begin{array}{l} H_{c} \\ x \end{array}) A_{c}^{x} {(1 - A_{c})}^{H_{c} - x} .$

The number of available hardware units x determines the number of versions that can be executed simultaneously: l_c(x). Therefore $Pr {L_{c} = l_{c} (x)} = Q_{c} (x) .$

The pairs Q_c(x), l_c(x) for 1≤x≤H_c determine the pmf of the discrete r.v. L_c.

Version termination times

In each component c, a sequence in which the

Analytical example

Consider a system consisting of two components. First component consists of H₁=2 hardware units with availability A₁=0.9 on which N₁=5 software versions with M₁=3 are executed. Second component consists of H₂=3 hardware units with availability A₂=0.8 on which and N₂=3 software versions with M₂=2 are executed. The parameters of versions r_ci and τ_ci are presented in Table 1.

One software version can be executed on each hardware unit: l_c(h_c)=h_c.

The terminations times obtained for different possible

Summary and further work

The considered model considers fault-tolerant systems with series architecture and arbitrary number of hardware units and software versions. The presented algorithm is aimed at evaluating system reliability and performance indices that can be used for comparison of different system configurations and for solving system structure optimization problems.

The model and the algorithm in their present form have the following limitations:

1.
The common cause failures are not taken into account. This leads

References (19)

X. Teng et al.
Software fault tolerance
L. Chen et al.
N-version programming: a fault tolerance approach to the reliable software
(1978)
B. Randell
System structure for software fault tolerance
IEEE Trans Software Eng
(1975)
A. Tai et al.
Performability enhancement of fault-tolerant software
IEEE Trans Reliab
(1993)
K. Goseva-Popstojanova et al.
Performability modeling of N version programming technique
(1995)
K. Goseva-Popstojanova et al.
Performability and reliability modeling of N version fault tolerant software in real-time systems
(1997)
F. Belli et al.
Fault-tolerant programs and their reliability
IEEE Trans Reliab
(1990)
N. Ashrafi et al.
Optimal design of large software-systems using N-version programming
IEEE Trans Reliab
(1994)
N. Wattanapongsakorn et al.
Reliability optimization models for embedded systems with multiple applications
IEEE Trans Reliab
(2004)

There are more references available in the full text version of this article.

Cited by (20)

Robust recurrent neural network modeling for software fault detection and correction prediction
2007, Reliability Engineering and System Safety
Software fault detection and correction processes are related although different, and they should be studied together. A practical approach is to apply software reliability growth models to model fault detection, and fault correction process is assumed to be a delayed process. On the other hand, the artificial neural networks model, as a data-driven approach, tries to model these two processes together with no assumptions. Specifically, feedforward backpropagation networks have shown their advantages over analytical models in fault number predictions. In this paper, the following approach is explored. First, recurrent neural networks are applied to model these two processes together. Within this framework, a systematic networks configuration approach is developed with genetic algorithm according to the prediction performance. In order to provide robust predictions, an extra factor characterizing the dispersion of prediction repetitions is incorporated into the performance function. Comparisons with feedforward neural networks and analytical models are developed with respect to a real data set.
Redundancy issues in software and hardware systems: An overview
2011, International Journal of Reliability, Quality and Safety Engineering
Optimization of Software Test Scheduling under Development of Modular Software Systems
2023, Symmetry
Reliability of N-version programming software with testing effort
2020, International Journal of Reliability and Safety
Reconfigurable hardware technology: An emerging paradigm for combined software-hardware fault-tolerance implementation
2018, Control Engineering and Applied Informatics
Analysis of industrial water process system considering various major/minor fault and provision of random switching on minor faults
2017, Journal of Applied Probability and Statistics

View all citing articles on Scopus

View full text

Reliability and performance analysis of hardware–software systems with fault-tolerant software components

Abstract

Introduction

Section snippets

Model

Number of versions that can be executed simultaneously

Version termination times

Analytical example

Summary and further work

Software fault tolerance

N-version programming: a fault tolerance approach to the reliable software

System structure for software fault tolerance

IEEE Trans Software Eng

Performability enhancement of fault-tolerant software

IEEE Trans Reliab

Performability modeling of N version programming technique

Performability and reliability modeling of N version fault tolerant software in real-time systems

Fault-tolerant programs and their reliability

IEEE Trans Reliab

Optimal design of large software-systems using N-version programming

IEEE Trans Reliab

Reliability optimization models for embedded systems with multiple applications

IEEE Trans Reliab