Propagated failure analysis for non-repairable systems considering both global and selective effects
Introduction
The complete failure state of a component can be classified as either local failure or propagated failure. The local failure of a component causes only the outage of the single component and has no effect on other system components; whereas the propagated failure of a component, usually caused by imperfect fault coverage or destructive effect, not only causes the outage of the single component itself but also causes damage to other system components [1]. Imperfect fault coverage, which has been studied intensively, results from the malfunction of automatic fault detect/recovery mechanism of the system [2], [3], [4]. Take a three-component system sharing a workload as an example, when one of the three components fails, the automatic recovery mechanism should detect the failed component and redistribute the workload to the rest two components. However, the automatic recovery mechanism could fail so that the failure of the component remains uncovered, causing part of the work to be uncompleted. Thus the entire system fails. Another cause to propagated failure is destructive effect (e.g., explosion, blackout, overheating, voltage surge, etc.) of some system component on other system components in the case of failure [5].
The propagated failure can be further classified into propagated failure with global effect (PFGE) and propagated failure with selective effect (PFSE) according to the scope of affected components [1], [6]. PFGE from one component propagates through the entire system and causes the damage to all the system components; whereas PFSE affects part of the remaining system components. Consider a system having three components working in parallel. When one component fails, the system can continue working although the entire system performance degrades. However, the failure of the component might cause overheating and thus explosion, which causes the destruction of other components. If the explosion causes damage to one of the remaining two components, the system keeps working with one remaining component. Such failure can be considered as a PFSE. But if the explosion affects all of the remaining components, the entire system fails. Such failure can be considered as a PFGE. Fig. 1 shows the classification of the component failure state.
The global/selective effect of a propagated failure, however, can be isolated in a system with function dependence behavior. As formally described in [7] and later studied by [8], [9], [15], [16], the functional dependence occurs when the failure of one component (referred to as a trigger component) causes other components (referred to as dependent components) within the same system to become unusable or inaccessible. In the dynamic fault tree analysis [7], a special gate called FDEP gate has been used to model the functional dependence behavior, as illustrated in Section 3. In systems with the functional dependence, the failure of a trigger component can isolate the propagated failure that originates from any of the dependent components. However, the failure isolation effect takes place only when the trigger component fails before any propagated failure originating from the dependent components happens. In other words, there exists competition between the failure propagation and failure isolation events in the time domain [1]. For example, the communication among the computers is achieved through the Network Interface Cards (NIC) in computer networks. In this case, the NIC is the trigger and the connected computers are dependent components. When the NIC fails before any failure from the connected computer propagates through the network, it not only makes the connected computer inaccessible, but also makes the network insensitive to any failure of the connected computers. However, if the propagated failure from a dependent component happens before the NIC fails, the failure isolation effect does not take place and the propagated failure can cause the entire network to break down.
Reliability of systems subject to PFGE and the failure isolation effect has been recently studied for both binary systems [1] and multi-state systems [5], [10]. The PFSE has also been recently studied for multi-state systems [6]. But the algorithm in [6] can only be applied to series-parallel systems and does not consider the failure isolation effect. To the best of our knowledge, no work has been done to consider PFGE and PFSE at the same time as well as their competition with the failure isolation effect. In this paper, we develop an analytical and combinatorial method for analyzing the competing failures in the reliability analysis of non-repairable binary-state systems subject to failure propagation with both global and selective effects. The method is not limited to series-parallel systems. The following assumptions are made in our proposed method:
- 1.
Different functional dependence groups are independent; i.e. different trigger elements cannot have the same dependent components.
- 2.
Propagated failures from dependent components only cause local failures of other components.
- 3.
Trigger components can only have independent failures; they cannot be affected by PFSE from other components.
- 4.
The propagated failures can originate only from dependent components (relaxed in Section 5).
- 5.
The failure of trigger components makes all the corresponding dependent components unavailable.
The remainder of the paper is organized as follows. Section 2 presents the proposed combinatorial approach for the reliability analysis of binary-state systems subject to competing failure isolation and failure propagation events with both global and selective effects. Section 3 gives an illustrative example and the detailed analysis of the example system using the proposed method. Section 4 verifies the method by comparing results of the example system obtained using the proposed approach and results obtained using a Markov-based method. The generalizations of the proposed method are described in Section 5. Conclusions and future work are given in Section 6.
Section snippets
The proposed combinatorial approach
The proposed combinatorial approach for the reliability analysis of binary system subject to competing failure isolation and failure propagation events with both global and selective effects can be described as the following step-by-step procedure:
Step 1: Define two disjoint events representing states of the trigger component: R1—the trigger component is functioning; R2—the trigger component fails. Based on the total probability theorem, the system unreliability can be evaluated using:
System description
Fig. 2 illustrates a memory system within a computer system. The memory system is composed of an embedded memory block and an external memory block labeled EMB. The embedded memory block is further composed of an independent memory module (MM), and two memory chips (MC1, MC2) that are accessed through a memory interface unit (MIU). In other words, the memory chips are functionally dependent on the MIU. The entire memory system fails when both the embedded memory block and EMB fail. The embedded
Verification using Markov method
To verify the correctness of the proposed method, we also perform the Markov analysis of the example system. Fig. 8 illustrates the compact state transition diagram in the Markov solution. Note that each node in the diagram is indicated by the corresponding state number and a set of events that can occur in that state. For example, in node/state 7, only event El can happen.
Solving the Markov model, we obtain the system unreliability as the probability of the system being in the failure state
Generalization of the proposed method
As discussed in Section 2, the proposed method assumes that the non-dependent components including trigger component only suffer LF. In this section, we generalize the proposed method to allow the non-dependent components to have LF, PFGE and PFSE.
Conclusions and future work
This paper has proposed a combinatorial method for the reliability analysis of non-repairable binary system subject to competing failure isolation and failure propagation events with both global and selective effects. The proposed method is not limited to series, parallel or series-parallel structures. However, it is not directly applicable to traditional network analysis where k-terminal reliabilities are used as reliability metrics [17]. As illustrated through the example analysis, based on
Acknowledgments
This work was supported in part by the US National Science Foundation under grant number 0832594. The authors are thankful to the editor and anonymous reviewers for their valuable comments that help us improve the quality of the paper.
References (19)
- et al.
Combinatorial analysis of systems with competing failures subject to failure isolation and propagation effects
Reliability Engineering and System Safety
(2010) - et al.
Multi-state systems with multi-fault coverage
Reliability Engineering and System Safety
(2008) - et al.
Reliability and performance of multi-state systems with propagated failures having selective effect
Reliability Engineering and System Safety
(2010) - et al.
Markov and Markov reward model transient analysis: an overview of numerical approaches
European Journal of Operational Research
(1989) Statistical complexity of the power method for Markov chains
Journal of Complexity
(1989)- et al.
Monte-Carlo simulation analysis of the effects on different system performance levels on the importance on multi-state components
Reliability Engineering & System Safety
(2003) - et al.
A separable method for incorporating imperfect fault-coverage into combinatorial models
IEEE Transactions on Reliability
(1999) - et al.
Analysis of generalized phased-mission system reliability, performance and sensitivity
IEEE Transactions on Reliability
(2002) - et al.
Combinatorial algorithm for reliability analysis of multi-state systems with propagated failures and failure isolation effect
IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans
(2011)