Propagated failure analysis for non-repairable systems considering both global and selective effects

https://doi.org/10.1016/j.ress.2011.11.005Get rights and content

Abstract

This paper proposes an algorithm for the reliability analysis of non-repairable binary systems subject to competing failure propagation and failure isolation events with both global and selective failure effects. A propagated failure that originates from a system component causes extensive damage to the rest of the system. Global effect happens when the propagated failure causes the entire system to fail; whereas selective effect happens when the propagated failure causes only failure of a subset of system components. In both cases, the failure propagation that originates from some system components (referred to as dependent components) can be isolated because of functional dependence between the dependent components and a component that prevents the failure propagation (trigger components) when the failure of the trigger component happens before the occurrence of the propagated failure. Most existing studies focus on the analysis of propagated failures with global effect. However, in many cases, propagated failures affect only a subset of system components not the entire system. Existing approaches for analyzing propagated failures with selective effect are limited to series-parallel systems. This paper proposes a combinatorial method for the propagated failure analysis considering both global and selective effects as well as the competition with the failure isolation in the time domain. The proposed method is not limited to series-parallel systems and has no limitation on the type of time-to-failure distributions for the system components. The method is verified using the Markov-based method. An example of computer memory systems is analyzed to demonstrate the application of the proposed method.

Introduction

The complete failure state of a component can be classified as either local failure or propagated failure. The local failure of a component causes only the outage of the single component and has no effect on other system components; whereas the propagated failure of a component, usually caused by imperfect fault coverage or destructive effect, not only causes the outage of the single component itself but also causes damage to other system components [1]. Imperfect fault coverage, which has been studied intensively, results from the malfunction of automatic fault detect/recovery mechanism of the system [2], [3], [4]. Take a three-component system sharing a workload as an example, when one of the three components fails, the automatic recovery mechanism should detect the failed component and redistribute the workload to the rest two components. However, the automatic recovery mechanism could fail so that the failure of the component remains uncovered, causing part of the work to be uncompleted. Thus the entire system fails. Another cause to propagated failure is destructive effect (e.g., explosion, blackout, overheating, voltage surge, etc.) of some system component on other system components in the case of failure [5].

The propagated failure can be further classified into propagated failure with global effect (PFGE) and propagated failure with selective effect (PFSE) according to the scope of affected components [1], [6]. PFGE from one component propagates through the entire system and causes the damage to all the system components; whereas PFSE affects part of the remaining system components. Consider a system having three components working in parallel. When one component fails, the system can continue working although the entire system performance degrades. However, the failure of the component might cause overheating and thus explosion, which causes the destruction of other components. If the explosion causes damage to one of the remaining two components, the system keeps working with one remaining component. Such failure can be considered as a PFSE. But if the explosion affects all of the remaining components, the entire system fails. Such failure can be considered as a PFGE. Fig. 1 shows the classification of the component failure state.

The global/selective effect of a propagated failure, however, can be isolated in a system with function dependence behavior. As formally described in [7] and later studied by [8], [9], [15], [16], the functional dependence occurs when the failure of one component (referred to as a trigger component) causes other components (referred to as dependent components) within the same system to become unusable or inaccessible. In the dynamic fault tree analysis [7], a special gate called FDEP gate has been used to model the functional dependence behavior, as illustrated in Section 3. In systems with the functional dependence, the failure of a trigger component can isolate the propagated failure that originates from any of the dependent components. However, the failure isolation effect takes place only when the trigger component fails before any propagated failure originating from the dependent components happens. In other words, there exists competition between the failure propagation and failure isolation events in the time domain [1]. For example, the communication among the computers is achieved through the Network Interface Cards (NIC) in computer networks. In this case, the NIC is the trigger and the connected computers are dependent components. When the NIC fails before any failure from the connected computer propagates through the network, it not only makes the connected computer inaccessible, but also makes the network insensitive to any failure of the connected computers. However, if the propagated failure from a dependent component happens before the NIC fails, the failure isolation effect does not take place and the propagated failure can cause the entire network to break down.

Reliability of systems subject to PFGE and the failure isolation effect has been recently studied for both binary systems [1] and multi-state systems [5], [10]. The PFSE has also been recently studied for multi-state systems [6]. But the algorithm in [6] can only be applied to series-parallel systems and does not consider the failure isolation effect. To the best of our knowledge, no work has been done to consider PFGE and PFSE at the same time as well as their competition with the failure isolation effect. In this paper, we develop an analytical and combinatorial method for analyzing the competing failures in the reliability analysis of non-repairable binary-state systems subject to failure propagation with both global and selective effects. The method is not limited to series-parallel systems. The following assumptions are made in our proposed method:

  • 1.

    Different functional dependence groups are independent; i.e. different trigger elements cannot have the same dependent components.

  • 2.

    Propagated failures from dependent components only cause local failures of other components.

  • 3.

    Trigger components can only have independent failures; they cannot be affected by PFSE from other components.

  • 4.

    The propagated failures can originate only from dependent components (relaxed in Section 5).

  • 5.

    The failure of trigger components makes all the corresponding dependent components unavailable.

The remainder of the paper is organized as follows. Section 2 presents the proposed combinatorial approach for the reliability analysis of binary-state systems subject to competing failure isolation and failure propagation events with both global and selective effects. Section 3 gives an illustrative example and the detailed analysis of the example system using the proposed method. Section 4 verifies the method by comparing results of the example system obtained using the proposed approach and results obtained using a Markov-based method. The generalizations of the proposed method are described in Section 5. Conclusions and future work are given in Section 6.

Section snippets

The proposed combinatorial approach

The proposed combinatorial approach for the reliability analysis of binary system subject to competing failure isolation and failure propagation events with both global and selective effects can be described as the following step-by-step procedure:

Step 1: Define two disjoint events representing states of the trigger component: R1—the trigger component is functioning; R2—the trigger component fails. Based on the total probability theorem, the system unreliability can be evaluated using:Pr(System

System description

Fig. 2 illustrates a memory system within a computer system. The memory system is composed of an embedded memory block and an external memory block labeled EMB. The embedded memory block is further composed of an independent memory module (MM), and two memory chips (MC1, MC2) that are accessed through a memory interface unit (MIU). In other words, the memory chips are functionally dependent on the MIU. The entire memory system fails when both the embedded memory block and EMB fail. The embedded

Verification using Markov method

To verify the correctness of the proposed method, we also perform the Markov analysis of the example system. Fig. 8 illustrates the compact state transition diagram in the Markov solution. Note that each node in the diagram is indicated by the corresponding state number and a set of events that can occur in that state. For example, in node/state 7, only event El can happen.

Solving the Markov model, we obtain the system unreliability as the probability of the system being in the failure state

Generalization of the proposed method

As discussed in Section 2, the proposed method assumes that the non-dependent components including trigger component only suffer LF. In this section, we generalize the proposed method to allow the non-dependent components to have LF, PFGE and PFSE.

Conclusions and future work

This paper has proposed a combinatorial method for the reliability analysis of non-repairable binary system subject to competing failure isolation and failure propagation events with both global and selective effects. The proposed method is not limited to series, parallel or series-parallel structures. However, it is not directly applicable to traditional network analysis where k-terminal reliabilities are used as reliability metrics [17]. As illustrated through the example analysis, based on

Acknowledgments

This work was supported in part by the US National Science Foundation under grant number 0832594. The authors are thankful to the editor and anonymous reviewers for their valuable comments that help us improve the quality of the paper.

References (19)

There are more references available in the full text version of this article.

Cited by (0)

View full text