Reliability and performance analysis of hardware–software systems with fault-tolerant software components

https://doi.org/10.1016/j.ress.2005.04.004Get rights and content

Abstract

This paper presents an algorithm for evaluating reliability and expected execution time for systems consisting of fault-tolerant software components running on several hardware units. The components are built from functionally equivalent but independently developed versions characterized by different reliability and execution time. Different number of versions can be executed simultaneously depending on the number of available units. The system reliability is defined as the probability that the system produces a correct output in a specified time.

Introduction

Software failures are caused by errors made in various phases of program development. When the software reliability is of critical importance, special programming techniques are used in order to achieve its fault tolerance. Two of the best-known fault-tolerant software design methods are N-version programming (NVP) and recovery block scheme (RBS) [1]. Both methods are based on the redundancy of software modules (functionally equivalent but independently developed) and the assumption that coincident failures of modules are rare. The fault tolerance usually requires additional resources and results in performance penalties (particularly with regard to computation time), which constitutes a tradeoff between software performance and reliability.

NVP was proposed by Chen and Avizienis [2]. This approach presumes the execution of N functionally equivalent software modules (called versions) that receive the same input and send their outputs to a voter, which is aimed at determining the system output. The voter produces an output if at least M out of N outputs agree (it is presumed that the probability that M wrong outputs agree is negligibly small). Otherwise, the system fails. Usually majority voting is used in which N is odd and M=(N+1)/2.

In some applications, the available computational resources do not allow all of the versions to be executed simultaneously. In these cases, the versions are executed according to some predefined sequence and the program execution terminates either when M versions produce the same output (success) or when after the execution of all the N versions the number of equivalent outputs is less than M (failure). The entire program execution time is a random variable depending on the parameters of the versions and on the number of versions that can be executed simultaneously.

RBS was proposed by Randell [3]. In this approach after execution of each version, its output is tested by an acceptance test block (ATB). If the ATB accepts the version output, the process is terminated and the version output becomes the output of the entire system. If all N versions do not produce the accepted output, the system fails. If the computational resources allow simultaneous execution of several versions, the versions are executed according to some predefined sequence and the entire program terminates either when one of versions produces the output accepted by the ATB (success) or when after the execution of all the N versions no output is accepted by the ATB (failure). If the acceptance test time is included into the execution time of each version, the RBS performance model becomes identical to the performance model of the NVP with M=1.

Estimating the effect of the fault-tolerant programming on system performance is especially important in safety critical real-time computer applications. This effect has been studied by Tai et al. in [4] and by Goseva-Popstojanova and Grnarov in [5], [6]. While in [4] a basic realization of NVP (N=3, M=2) consisting of versions with identical fault probabilities and different execution times has been considered, in [5], [6] NVP with arbitrary N has been studied in which both times to failure and execution times of different versions are identically distributed random variables.

In many cases, the information about version reliability and execution time is available from separate testing and/or reliability prediction models [7]. This information can be incorporated into a fault-tolerant program model in order to obtain an evaluation of its reliability and performance. The reliability model of NVP with versions having different reliability has been considered in [8]. However, in this study, the system performance evaluation problem has not been addressed and a general algorithm for evaluating NVP reliability for arbitrary N and M has not been suggested.

Since the performance of fault-tolerant programs depends on availability of computational resources, the impact of hardware availability should be taken into account when the system availability is evaluated. In [9] several simple configurations of hardware–software systems with N≤3 and number of hardware units not greater than three have been studied also without considering system performance.

This paper presents an algorithm for finding the reliability and performance measures for arbitrary fault-tolerant hardware–software systems with given hardware structure and availability of hardware units and given parameters of software versions. The novelty of the presented algorithm lies in its ability to take into account both hardware and software reliability for arbitrary number of software versions and hardware units and to evaluate both system reliability and performance measures.

The algorithm does not take into account the common cause failures which leads to overestimation of system reliability. However, even such optimistic estimates can be used for comparison of different system architectures and for optimization of software system structure as it has been done in [8], [10], [11], [12].

The probabilities of common cause failures can be evaluated separately (elicited from experimental study) and added to system unreliability in order to obtain more accurate reliability estimates.

Section snippets

Model

According to the model presented in [8] the software system consists of C components. Each component performs a subtask and the sequential execution of the components performs a major task. Such series architecture can be found in many applications where the output of a component is fed to the next component as its input. An example of such architecture is a speech recognition system presented in [9]. Performance of different applications and services in the grid systems can also be represented

Number of versions that can be executed simultaneously

The number of available hardware units in component c can vary from 0 to Hc. Given all of the units are identical and have availability Ac, one can easily obtain probabilities Pr{hc=x} for 1≤xHc:Qc(x)=Pr{hc=x}=(Hcx)Acx(1Ac)Hcx.

The number of available hardware units x determines the number of versions that can be executed simultaneously: lc(x). ThereforePr{Lc=lc(x)}=Qc(x).

The pairs Qc(x), lc(x) for 1≤xHc determine the pmf of the discrete r.v. Lc.

Version termination times

In each component c, a sequence in which the

Analytical example

Consider a system consisting of two components. First component consists of H1=2 hardware units with availability A1=0.9 on which N1=5 software versions with M1=3 are executed. Second component consists of H2=3 hardware units with availability A2=0.8 on which and N2=3 software versions with M2=2 are executed. The parameters of versions rci and τci are presented in Table 1.

One software version can be executed on each hardware unit: lc(hc)=hc.

The terminations times obtained for different possible

Summary and further work

The considered model considers fault-tolerant systems with series architecture and arbitrary number of hardware units and software versions. The presented algorithm is aimed at evaluating system reliability and performance indices that can be used for comparison of different system configurations and for solving system structure optimization problems.

The model and the algorithm in their present form have the following limitations:

  • 1.

    The common cause failures are not taken into account. This leads

References (19)

  • X. Teng et al.

    Software fault tolerance

  • L. Chen et al.

    N-version programming: a fault tolerance approach to the reliable software

    (1978)
  • B. Randell

    System structure for software fault tolerance

    IEEE Trans Software Eng

    (1975)
  • A. Tai et al.

    Performability enhancement of fault-tolerant software

    IEEE Trans Reliab

    (1993)
  • K. Goseva-Popstojanova et al.

    Performability modeling of N version programming technique

    (1995)
  • K. Goseva-Popstojanova et al.

    Performability and reliability modeling of N version fault tolerant software in real-time systems

    (1997)
  • F. Belli et al.

    Fault-tolerant programs and their reliability

    IEEE Trans Reliab

    (1990)
  • N. Ashrafi et al.

    Optimal design of large software-systems using N-version programming

    IEEE Trans Reliab

    (1994)
  • N. Wattanapongsakorn et al.

    Reliability optimization models for embedded systems with multiple applications

    IEEE Trans Reliab

    (2004)
There are more references available in the full text version of this article.

Cited by (20)

View all citing articles on Scopus
View full text