Abstract
The widespread reliance on software for mission and life critical applications makes the reliability of these systems essential. Techniques such as fault tolerance have been proposed to achieve the highest levels of software reliability. However, the fault tolerance paradigm suffers from the risk of correlated failures, where a majority of the software versions fail on the same input leading to system failure. This paper derives a trivariate Bernoulli distribution to quantify the negative impact of correlated failures on the reliability of fault tolerant software composed of highly reliable versions. An experiment based on early empirical research demonstrates the capacity of the distribution to conduct reliability assessment for many combinations of the version reliabilities and correlations. The results indicate that correlated failures detract from system reliability, but that this reliability is often higher than a system composed of the single most reliable version.
Similar content being viewed by others
Abbreviations
- TVB:
-
Trivariate Bernoulli Distribution
- NVP:
-
\(N\)-version programming
- \(Y\) :
-
Sum of uncorrelated Poisson random variables
- \(X(\lambda )\) :
-
Poisson random variable with rate \(\lambda \)
- \(Z_{i}(p_{i})\) :
-
Bernoulli random variable with success probability \(p_{i}\,(1\le i\le 3)\)
- \({\mathbf{p}}\) :
-
Vector of multivariate Bernoulli success probabilities
- \(I_{\{0\}}(\cdot )\) :
-
Indicator function of zero
- \(\rho _{i,j}\) :
-
Correlation between Bernoulli variables \(Z_{i}\) and \(Z_j\)
- \(\rho _{i,j}^{+}\) :
-
Upper bound on correlation between \(Z_{i}\) and \(Z_j\)
- \(\varvec{\Sigma }\) :
-
Correlation matrix consisting of entries \(\rho _{i,j}\)
- \(\alpha _{i,j}\) :
-
Poisson rate encoding correlation between \(Z_{i}\) and \(Z_j\)
- \(\varvec{\alpha }(k)\) :
-
Matrix of Poisson encoding in iteration \(k\)
- \(\beta _{i,j}\) :
-
Notational simplification \((\beta _{i,j}=\exp (\alpha _{i,j}))\)
- \(\beta _{i,j}^{(1)}\) :
-
Entry \(\beta _{i,j}\) of matrix encoding in first iteration
- \(\varGamma \) :
-
Set of systems \(\varGamma =\{\mathtt{series}, \mathtt{parallel}, \mathtt{two}\, \mathtt{out}\, \mathtt{of}\, \mathtt{three}\}\)
- \(R_{\gamma }\) :
-
\(s\)-Expected reliability of system \(\gamma \in \varGamma \)
References
Alger, L., & Lala, J. (1986). A real time operating system for a nuclear power plant computer. In Proceedings of IEEE Real-Time Systems Symposium (pp. 244–248). New Orleans, LA.
Avizienis, A. (1985). The \(n\)-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, 11(12), 1491–1501.
Avizienis, A., Lyu, M., & Schutz, W. (1988). In search of effective diversity: A six-language study of fault-tolerant flight control software. In Proceedings of international symposium on fault tolerant computing (FTC 88) (pp. 15–22).
Eckhardt, D., & Lee, L. (1985). A theoretical basis for the analysis of multiversion software subject to coincident errors. IEEE Transactions on Software Engineering, 11(12), 1511–1517.
Eckhardt, D., Caglayan, A., Knight, J., Lee, L., McAllister, D., Vouk, M., et al. (1991). An experimental evaluation of software redundancy as a strategy for improving reliability. IEEE Transactions on Software Engineeering, 17(7), 692–702.
Feller, W. (1968). An introduction to probability and its application (3rd ed.). New York, NY: Wiley.
Fiondella, L. (2010). Reliability and sensitivity analysis of coherent systems with negatively correlated component failures. International Journal of Reliability, Quality and Safety Engineering, 17(5), 505–529.
Fisher, R. (1924). On a distribution yielding the error functions of several well known statistics. In Proceedings of the international congress of mathematics, Toronto (Vol, 2, pp. 805–813).
Johnson, N., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions. Series in probability and statistics. New York, NY: Wiley.
Knight, J., & Leveson, N. (1986). An experimental evaluation of the assumption of independence in multi-version programming. IEEE Transactions on Software Engineering, 12(1), 96–109.
Knuth, D. (1997). Seminumerical algorithms (3rd ed., Vol. 2). Reading, MA: Addison Wesley.
Lai, C., & Xie, M. (2006). Stochastic ageing and dependence for reliability. New York, NY: Springer.
Littlewood, B. (1996). The impact of diversity upon common mode failures. Reliability Engineering and System Safety, 51(1), 101–113.
Littlewood, B., & Miller, D. (1989). Conceptual modeling of coincident failures in multiversion software. IEEE Transactions on Software Engineering, 15(12), 1596–1614.
Littlewood, B., Popov, P., & Stringini, L. (2001). Modeling software design diversity—a review. ACM Computing Surveys, 33(2), 177–208.
Lyu, M., & He, Y. (1993). Improving the n-version programming process through the evolution of a design paradigm. In IEEE Transactions on Reliability (Vol. 42, pp. 179–189).
Musa, J. (1994). Sensitivity of field failure intensity to operational profile errors. In Proceedings of international symposium on software reliability engineering (ISSRE 94) (pp. 1334–337).
Musa, J., Fuoco, G., Irving, N., Kropfl, D., & Juhlin, B. (1996). Handbook of software reliability engineering. In The operational profile (pp. 167–216). New York, NY: McGraw-Hill.
Park, C., Park, T., & Shin, D. (1996). A simple method for generating correlated binary variates. The American Statistician, 50(4), 306–310.
Prentice, R. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association, 81(394), 321–327.
Scott, R., Gault, J., & McAllister, D. (1987). Fault-tolerant software reliability modeling. IEEE Transactions on Software Engineering, 13(5), 582–592.
Shapiro, A. (2005). An ultra reliability project for NASA. In Proceedings of IEEE aerospace conference, Big Sky, MT (pp. 1–12).
Singpurwalla, N. (2006). Reliability and risk: A Bayesian perspective. Series in probability and statistics. New York, NY: Wiley.
Tang, D., & Iyer, R. (1992). Analysis and modeling of correlated failures in multicomputer systems. IEEE Transactions on Computers, 41(5), 567–577.
Teng, X., & Pham, H. (2002). A software-reliability growth model for n-version programming systems. IEEE Transactions on Reliability, 51(3), 311–321.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
Proof of Theorem 1
Without loss of generality, let \( p_{1}> p_{2}> p_{3}\). It follows directly from Eqs. (16) and (21) that \(\alpha _{1,1}<\alpha _{2,2}<\alpha _{3,3}\) and that
These inequalities reduce the number of cases to the eight given in Table 1. The first six correspond to permutations of \(\alpha _{1,2}, \alpha _{1,3}\), and \(\alpha _{2,3}\) followed by \(\alpha _{1,1}<\alpha _{2,2}<\alpha _{3,3}\). In the other two cases, \(\alpha _{1,1}<a_{2,3}\) and the beginning of the sequence is \(\alpha _{1,2}<\alpha _{1,3}\) or \(\alpha _{1,3}<\alpha _{1,2}\). The proof of Case I where,
proceeds as follows.
Proof
Using the algorithm of (Park et al. 1996), write the initial \(\alpha _{i,j}\) in matrix form as
We know from Eq. (23) that \(\alpha _{1,2}\) is the minimum element, so we subtract it from each entry of Eq. (24), producing
Subtracting, \(\alpha _{1,2}\) preserves the order of the nonzero entries in Eq. (25), so that \((\alpha _{1,3}-\alpha _{1,2})<(\alpha _{2,3} -\alpha _{1,2})<(\alpha _{1,1}-\alpha _{1,2}) <(\alpha _{2,2}-\alpha _{1,2})<(\alpha _{3,3}-\alpha _{1,2})\), which implies that the entry \((\alpha _{1,3}-\alpha _{1,2})\) in position \(\alpha _{1,3}(1)\), is the smallest value remaining. Thus, we subtract \((\alpha _{1,3}-\alpha _{1,2})\) from \(\alpha _{1,1}(1), \alpha _{1,3}(1)\), and \(\alpha _{3,3}(1)\), producing
The algorithm (Park et al. 1996) requires that all off diagonal elements \(\alpha _{i,j}, i< j\), be eliminated before the elements \(\alpha _{i,i}\) on the diagonal to successfully encode the correlations. Because \((\alpha _{2,3}-\alpha _{1,2})<(\alpha _{2,2}-\alpha _{1,2})\) from the initial ordering the only requirement is that \((\alpha _{2,3}-\alpha _{1,2})<(\alpha _{3,3}-\alpha _{1,3})\), which is equivalent to
Repeating this derivation process for each of the eight cases given in Table 1, reveals that the term in the numerator on the right-hand side of Eq. (27) is the smallest \(\beta _{i,j}\) of the initial sequence, \(\beta _{i,j}^{(1)}\). Note that the denominator of Eq. (27) has been rewritten as a product series to avoid enumerating the three specific forms of the denominator that occur when the numerator is \(\beta _{1,2}, \beta _{1,3}\), or \(\beta _{2,3}\). It also follows that for each of the eight cases, the subscript of \(p\) in Eq. (27), is equal to the third variable index \(k\) not appearing in \(\beta _{i,j}^{(1)}\). These two generalizations lead to the bound provided in Eq. (17).
To complete the proof, assume that Eq. (27) is satisfied and subtract \((\alpha _{2,3}-\alpha _{1,2})\) from \(\alpha _{2,2}(2), \alpha _{2,3}(2)\), and \(\alpha _{3,3}(2)\), producing
Now the on-diagonal elements can be eliminated in isolation, so do this in the order \(\alpha _{1,1}(3), \alpha _{2,2}(3), \alpha _{3,3}(3)\).
Table 3 shows the symbolic value subtracted in each iteration. These values are the rate parameters of independent Poisson random variables \(Y_{l}, 1\le l\le 6\). The sets indicate the subset of correlated Bernoulli variables to which these independent Poisson variables belong.
Thus, the expressions for the success of the three correlated Bernoulli variables are
Here, the indicator function \(I_{\{0\}}(.)=1\) if the sum of the outcomes of \(Y_{l}\) in \(Z_{i}\) equals zero. Hence \(Z_{1}=1\) if and only if \(Y_{1}=Y_{2}=Y_{4}=0, Z_{2}=1\) if and only if \(Y_{1}=Y_{3}=Y_{5}=0\), while \(Z_{3}=1\) if and only if \(Y_{1}=Y_{2}=Y_{3}=Y_{6}=0\). Table 4 shows the probability that variable \(Y_{l}=0\), which is defined as \(m_{l}:=Pr\{Y_{l}=0\}=\exp (-\lambda _{l})\).
The only outcome that contributes to the correlated outcome \(E[Z_{1}Z_{2}Z_{3}]\), where all three experiments are successful is the uncorrelated outcome where all six independent Poisson variables are zero. Thus, the probability of this outcome is
The remaining seven outcomes are easily obtained from Eqs. (29) and (12), with expectations. For example,
and the cases \(E[Z_{1}\overline{Z}_{2}Z_{3}]\) and \(E[\overline{Z}_{1}Z_{2}Z_{3}]\) are symmetric to Eq. (30). Similarly,
and the cases \(E[Z_{1}\overline{Z}_{2}\overline{Z}_{3}]\) and \(E[\overline{Z}_{1}Z_{2}\overline{Z}_{3}]\) are symmetric to Eq. (31). The final outcome where all three experiments result in failure is
Repeating this derivation for Cases II–VIII, given in Table 1, reveals \(\beta ^{(1)}_{i,j}\) is the general form of the denominator in terms of the form \(\frac{\prod _{i=1}^{3} p_{i}\prod _{i<j}\beta _{i,j}}{\beta _{1,2}}\) in Eqs. (29)–(32). This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Fiondella, L., Zeephongsekul, P. Trivariate Bernoulli distribution with application to software fault tolerance. Ann Oper Res 244, 241–255 (2016). https://doi.org/10.1007/s10479-015-1798-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-015-1798-4