Quantifying reliability uncertainty from catastrophic and margin defects: A proof of concept

doi:10.1016/j.ress.2010.10.006

Reliability Engineering & System Safety

Volume 96, Issue 9, September 2011, Pages 1063-1075

https://doi.org/10.1016/j.ress.2010.10.006 Get rights and content

Abstract

We aim to analyze the effects of component level reliability data, including both catastrophic failures and margin failures, on system level reliability. While much work has been done to analyze margins and uncertainties at the component level, a gap exists in relating this component level analysis to the system level. We apply methodologies for aggregating uncertainty from component level data to quantify overall system uncertainty. We explore three approaches towards this goal, the classical Method of Moments (MOM), Bayesian, and Bootstrap methods. These three approaches are used to quantify the uncertainty in reliability for a system of mixed series and parallel components for which both pass/fail and continuous margin data are available. This paper provides proof of concept that uncertainty quantification methods can be constructed and applied to system reliability problems. In addition, application of these methods demonstrates that the results from the three fundamentally different approaches can be quite comparable.

Highlights

► We develop three distinct methods for quantifying reliability uncertainty for complex systems. ► We illustrate our three methods using a system of mixed series and parallel components. ► We show that the methods (Method of Moments, Bayesian, and Bootstrap) are comparable. ► We provide a proof-of-concept for quantifying uncertainty about system reliability.

Introduction

Quantification of Margins and Uncertainties (QMU) analysis is an approach towards computing and expressing component margin relative to a requirement or relative to a threshold with a connecting component. A QMU analysis may be as simple as a histogram of a component's performance output shown relative to a performance requirement, with the margin expressed in standard deviations from the mean. In more complex applications, such as components that have an age trend in their performance data, the QMU analysis could involve a linear regression against component age, along with uncertainty bands around the regression line. In such QMU analysis we are generally interested in making an end of life prediction.

When executing a QMU analysis we can express a component's capability in terms of K factors. K is defined as (margin/ uncertainty), or more explicitly as $K = [(upper threshold - sample mean) / sample standard deviation] for an upper threshold;$ or $K = [(sample mean ‐ lower threshold) / sample standard deviation] for a lower threshold .$ K provides a common measure of margin. Given a distribution function fitted to the data, K also correlates to a fraction defective (that is, fraction that fails to meet the requirement). If one has data normally distributed (a big IF), K factors of 3 or greater are considered adequate. As we often have data that is not normally distributed, we examine an expected fraction defective instead in evaluating the adequacy of the margin.

A QMU analysis is based upon critical performance variables using data taken during lot acceptance, surveillance, flight testing, and other sources. A typical QMU analysis spends most of the initial resources in sorting through this data teasing out significant factors that may influence the data, such as changes to testers during the elapsed time the data was taken, trends during production, test conditions, and lot-to-lot variation.

Computing the margin (defined as data mean—requirement) requires some definition of the requirement. To date this has generally meant using a published requirement in an interface definition or product acceptance specification. However, we have lately advanced towards calculating joint probabilities between the performance distributions and threshold distribution of two connecting components in order to correctly understand the true margin between the components.

To date, the Nuclear Weapons complex has conducted many component level QMU analyses [1], [2], [3], [4], [5] but has not yet developed ways to integrate these component level analyses into a framework for making system level inferences. What impact does low margin of a component have on the overall system reliability and uncertainty? Being able to answer such a question allows us to leverage QMU assessments to improve system level decisions, as well as assess the impact of low component margins on overall system level reliability and uncertainty. We can do this by integrating margin insufficiency failure modes with quality defect failure modes in a common model. By computing the overall system uncertainty we get a measure of “confidence” in the system that we can use as a decision tool in evaluating different system projects and in identifying where additional testing resources should be allocated.

Nuclear weapons are designed with multiple objectives that include safety, security, and reliability. We limit our focus to reliability, defined as the probability of success of the weapon performing its intended function at the intended time given the required temperature range, shock and vibration exposures, altitude and speed of the release envelope and over the designed lifetime of the weapon. The goal of the Department of Energy (DOE)/National Nuclear Security Administration (NNSA) weapon reliability assessment process is to provide a quantitative metric for this assessment. More details about this process are given in Wright and Bierbaum [6].

We typically use both system level and component level test data in computing the system reliability, using a series/parallel model of the critical events to compute the overall probability of success. Component level failure modes (and data) lie behind each of the critical events. Thus several component level failure modes are combined using series/parallel models to compute the probability of an event occurring successfully. Component level failure modes are computed using failure data; thus the reliability of an event is some series/parallel combination of component reliabilities. In equation form, for two components joined in series for a particular event, we have $R_{event} = R_{component 1} ⁎ R_{component 2} = (1 - Prob (component 1 failure)) (1 - Prob (component 2 failure))$

This paper demonstrates a proof of concept of three methods for estimating system level reliability and uncertainty using component-level data. The data used in the analyses presented here consist of both catastrophic failures (pass/fail data) and margin failures (continuous data). The analyses are applied to a relatively simple system model, consisting of both mixed series and parallel components. Three diverse methods (Method of Moments, Bayesian, and Bootstrap) are used and the results from each are compared to the NNSA system reliability point estimate approach [7]. The example system model captures some key features of the top-level models used by Sandia National Laboratories and Los Alamos National Laboratory to assess weapon reliability. While this example uses a two level model (events and component failure modes), all three approaches can be applied on more complex systems (though with varying degrees of difficulty) and are not restricted to two levels.

The NNSA approach, described in Section 2, is currently the standard method used for reporting point estimates for system reliability. Historically, point reliability estimates have sufficed to support military operational planning. The NNSA Surveillance Program has historically focused on detecting quality defects early in the life of the stockpile rather than uncertainty measures associated with the known failure modes and existing data.

With a mature stockpile, limited production opportunities, and a now-extensive surveillance data base, understanding the residual uncertainty associated with known and measurable failure modes has grown in importance. A key feature of QMU reliability analyses is to focus not only on a point estimate of reliability, but also on the uncertainty associated with the estimate. Understanding uncertainty can contribute to subsequent decision-making. Hence the three methods presented here seek to complement the NNSA point reliability and to provide a unified mechanism for assessing system-level reliability (point estimates) and reliability uncertainty (interval estimates) associated with component-level catastrophic failures and margin failures. Each method can measure the contribution of each failure source to the overall system reliability uncertainty. Each method can also be applied to a variety of system structures including mixed series/parallel systems.

The first method that complements the NNSA point estimate by adding appropriate uncertainty intervals is a classical Method of Moments (MOM) approach. (See Kotz et al. [8] for a summary discussion of the Method of Moments and an extensive list of historical and more recent references.) This approach captures and aggregates sampling uncertainties at the component level. The method evaluates the mean and variance of the various component-level reliability estimators and then propagates these through the system level reliability equation. Simplifying assumptions are made as to the distributional form of the system reliability distribution to get an approximate value for the mean and variance of the system reliability estimator. We then construct a 90% confidence interval for the system reliability using an equivalent Binomial distribution.

The second method presented applies some of the modern computing-intensive approaches currently available. It is a Bayesian approach using Markov Chain Monte Carlo (MCMC) methods to develop the system level distribution for reliability. This approach selects a user-specified diffuse prior distribution for each component, which is then updated based upon the data to develop the posterior distribution. By this mechanism, both sampling and other knowledge uncertainties can be captured, although the specific example here emphasizes the former. The individual component level reliability estimates are then combined to obtain the system level reliability estimate with an associated uncertainty propagated from the component distributions. We then construct a 90% credible interval for system reliability using the computed system-level reliability distribution.

The third method also takes advantage of modern computing power and considers resampling through a Bootstrap approach. It re-samples the available data with replacement to develop the non-parametric component and system level probability distributions. The resulting system-level distribution is used to construct a 90% Bootstrap confidence interval for system reliability. This method employs the fewest assumptions of the three approaches.

All the three approaches begin with a small number of basic principles but then require a fair amount of mathematical machinery to execute. All three can be complex, depending on the complexity of the system under study. The MOM approach involves substantial analytic manipulation of the equations. The Bayesian approach is computationally intensive and relies on MCMC simulation, with tools becoming more readily available in software packages, such as R and WinBUGs. The Bootstrap approach is also computationally intensive, but can be programmed into readily available software such as Minitab or Matlab.

Section snippets

System model, data, and reliability block diagram

We define an illustrative example with an appropriate degree of resolution and data. The reliability block diagram for such a system is shown in Fig. 1. Each J and K block represents an event, which must be successfully completed for the system to perform as required.

Fig. 1 shows a simple system composed of four events. The probability of success of each event (R_i) is equivalent to the probability of the required components operating successfully (no catastrophic failure) and providing

Classical approach for propagating uncertainties using method of moments

The first approach develops a methodology for estimating potential errors in weapon reliability estimates that is based on classical statistical theory. We create an approximate confidence interval for the system reliability estimate, R_Sys by approximating its mean and variance from the moments of the random variables entering into it, R_J1c, R_J1m, R_J2c, R_J2m, R_J3c, and the various function and margin variables that enter in R_K1. We assume an underlying asymmetric distribution for R_Sys from

Bayesian approach for aggregating uncertainties

Bayesian inference provides a means for estimating reliability that is especially useful for combining component level data into a system reliability prediction in the presence of small sample sizes and different data types. We discuss the approach for our particular problem and refer the reader to Hoff [13] or Gelman et al. [14] for further details on Bayesian statistics. Bayesian methods applied directly to the subject area of reliability are detailed in many texts including Martz and Waller

Bootstrap approach

In this section, the bootstrap [20] is used to determine confidence limits for system reliability. We refer to these limits simply as “Bootstrap Confidence Limits” or “Bootstrap Confidence Intervals.” In the context of the problem we are addressing, there are four different reliability distributions (one for J1, J2, J3, and K1) that are estimated using the bootstrap. The four component level distributions are then combined via the system level reliability Eq. (1). As noted earlier, each of the

Comparison of approaches and future work

The three approaches presented here each provide a mechanism for assessing system-level reliability and reliability uncertainty associated with component level catastrophic failure and margin data. Each provides a means to measure and thus compare the system-level uncertainty contribution from the individual sources. For ease of presentation, the approaches have been demonstrated with a simple mixed series/parallel system. Real-world systems can be considerably more complex, and may involve

References (23)

Eardley, et. al. Quantification of margins and uncertainties, JASON Report JSR-04-330, MITRE Corporation; 2005....
Helton JC. Conceptual and computational basis for the quantification of margins and uncertainty. Sandia National...
Diegert K, Klenke S, Novotny G, Paulsen R, Pilch M, Trucano T. Toward a more rigorous application of margins and...
Ahearne J, et al. Evaluation of quantification of margins and uncertainties methodology for assessing and certifying...
D. Sharp et al.
QMU and Nuclear Weapons Certification: What's under the hood?
Los Alamos Science
(2003)
Wright DL, Rene L Bierbaum. Nuclear weapon reliability evaluation methodology, SAND2002-8133; 2002. Available online at...
Bierbaum RL, Cashen JJ, Kerschen TJ, Sjulim JM, Wright DL. DOE nuclear weapon reliability definition: history,...
Taylor Barry N, Chris E. Kuyatt Guidelines for evaluating and expressing the uncertainty of NIST measurement results,...
C.L. Clopper et al.
The use of confidence or fiducial limits illustrated in the case of the binomial
Biometrika
(1934)

Norman L. Johnson et al.

Discrete Distributions

(1969)

Cited by (3)

Margins associated with loss of assured safety for systems with multiple weak links and strong links
2020, Reliability Engineering and System Safety
Citation Excerpt :
Fortunately, the indicated agreement was observed and is extensively illustrated. The presented work has been performed in support of the National Nuclear Security Administration's (NNSA's) mandate for the quantification of margins and uncertainties (QMU) in analyses of the United States’ nuclear stockpile (see Refs. [14–17] for summary discussions of NNSA's mandate for QMU, Refs. [18–28] for additional background on the development of NNSA's mandate for QMU, and Refs. [29–40] for recent work on the implementation of NNSA's mandate for QMU). The content of this paper is based on a previously published Sandia National Laboratories technical report [41].
Representations for margins associated with loss of assured safety (LOAS) for weak link (WL)/strong link (SL) systems involving multiple time-dependent failure modes are developed. The following topics are described: (i) defining properties for WLs and SLs, (ii) background on cumulative distribution functions (CDFs) for link failure time, link property value at link failure, and time at which LOAS occurs, (iii) CDFs for failure time margins defined by (time at which SL system fails) − (time at which WL system fails), (iv) CDFs for SL system property values at LOAS, (v) CDFs for WL/SL property value margins defined by (property value at which SL system fails) − (property value at which WL system fails), and (vi) CDFs for SL property value margins defined by (property value of failing SL at time of SL system failure) − (property value of this SL at time of WL system failure). Included in this presentation is a demonstration of a verification strategy based on defining and approximating the indicated margin results with (i) procedures based on formal integral representations and associated quadrature approximations and (ii) procedures based on algorithms for sampling-based approximations.
Property values associated with the failure of individual links in a system with multiple weak and strong links
2020, Reliability Engineering and System Safety
Citation Excerpt :
The presentation then ends with a concluding discussion (Sect. 12). The presented work has been performed in support of the National Nuclear Security Administration's (NNSA's) mandate for the quantification of margins and uncertainties (QMU) in analyses of the United States’ nuclear stockpile (see Refs. [19–22] for summary discussions of NNSA's mandate for QMU, Refs. [23–33] for additional background on the development of NNSA's mandate for QMU, and Refs. [34–45] for recent work on the implementation of NNSA's mandate for QMU). The results for link failure properties developed in this presentation will be extensively used in following presentations on (i) time and failure property margins for systems involving multiple WLs and SLs [46] and (ii) delays in link failure time that are functions of link property value at the time of precursor link failure [47].
Representations are developed and illustrated for the distribution of link property values at the time of link failure in the presence of aleatory uncertainty in link properties. The following topics are considered: (i) defining properties for weak links and strong links, (ii) cumulative distribution functions (CDFs) for link failure time, (iii) integral-based derivation of CDFs for link property at time of link failure, (iv) sampling-based approximation of CDFs for link property at time of link failure, (v) verification of integral-based and sampling-based determinations of CDFs for link property at time of link failure, (vi) distributions of link properties conditional on time of link failure, and (vii) equivalence of two different integral-based derivations of CDFs for link property at time of link failure.
Ideas underlying the Quantification of Margins and Uncertainties
2011, Reliability Engineering and System Safety
Citation Excerpt :
When critically examined, there are examples of both aleatory and epistemic uncertainties in all elements of weapon performance; consequently, probability of frequency methods have the potential, as Garrick suggested, of being a unifying methodology for weapon performance expressed in the form of a distribution expressing confidence in possible performance reliability. Lorio et al. [64,65] have successfully demonstrated the determination of confidence limits using probability of frequency methods for reliability estimates and applied the same methods to the performance reliability evaluation for a specific stockpile weapon system. Initial efforts in this work focused on statistical confidence limits for component failure frequencies observed in stockpile testing for quality defects (i.e., the component completely failed to operate).
Key ideas underlying the application of Quantification of Margins and Uncertainties (QMU) to nuclear weapons stockpile lifecycle decisions are described. While QMU is a broad process and methodology for generating critical technical information to be used in U.S. nuclear weapon stockpile management, this paper emphasizes one component, which is information produced by computational modeling and simulation. In particular, the following topics are discussed: (i) the key principles of developing QMU information in the form of Best Estimate Plus Uncertainty, (ii) the need to separate aleatory and epistemic uncertainty in QMU, and (iii) the properties of risk-informed decision making (RIDM) that are best suited for effective application of QMU. The paper is written at a high level, but provides an extensive bibliography of useful papers for interested readers to deepen their understanding of the presented ideas.

¹: This work was performed under the auspices of the Los Alamos National Laboratory, operated by the University of California for the United States Department of Energy under contract W-7405-ENG-36.

²: Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

View full text