Optimal structure of multi-state systems with multi-fault coverage
Introduction
Fault tolerance is widely used to enhance system reliability, especially for systems with stringent reliability requirements, such as nuclear power controllers and flight control systems [12], [8], [22], [20]. However, as the fault and error handling mechanisms (detection, location, and isolation) themselves can fail, some failures can remain undetected or uncovered, which can lead to total failure of the entire system or its sub-systems [17], [26], [13]. Examples of this effect of uncovered faults can be found in computing systems, electrical power distribution networks, phased mission systems etc [5], [27], [24].
The probability of successfully covering a fault (avoiding fault propagation) given that the fault has occurred is known as the coverage factor [4], [1], [2]. Due to the existence of different fault covering mechanisms, different coverage models have been studied in literatures [25], [18], [19]. Among these models, element level coverage (ELC) model and fault level coverage (FLC) model are the most important and widely studied. For ELC, the coverage probability of each system component is independent from the status of other components. ELC is typical for systems containing a built-in test (BIT) capability, where the selection among the redundant elements is made on the basis of a self-diagnostic capability of the individual elements. For FLC, the coverage probability of a system element depends on the number of failed elements. In other words, the selection among redundant elements varies between initial and subsequent failures. In the HARP terminology [3], ELC models are known as single-fault models, whereas FLC models are known as multi-fault models. Multi-fault models have the ability to model a wide range of fault tolerant mechanisms. An example is a majority voting system among the currently known working elements, see Myers and Rauzy [18].
Due to imperfect fault coverage, the system reliability can decrease with increase of redundancy over some particular limit [11], [17]. As a result the system structure optimization problems arise. Some of these problems have been formulated and solved for parallel systems, k-out-of-n systems [1], [2]. Levitin [10] presents a model of series-parallel multi-state systems (MSS) with two types of task parallelization: parallel task execution with work sharing, and redundant task execution. A framework to solve the optimal balance of the two kinds of parallelization which maximizes the system reliability is proposed based on the assumption that the ELC applies in each work sharing group. Considering the different types of fault handling mechanisms in practice, the ELC model alone cannot adapt to all the cases. Though Levitin and Amari [11] proposed a way to evaluate the reliability of MSS considering FLC, the system structure optimization problem was not studied. In Levitin [10], the incorporation of imperfect coverage is handled by using a special term to denote the case of uncovered failure in the universal generating function (UGF) of each element. After calculating the probability that the system fails due to uncovered element failure, the problem is reduced to the case where no uncovered failure exists. Incorporating FLC into the system structure optimization framework is much more complicated than incorporating ELC, especially for coding and programming, as not only the system performance but also the number of failed elements in each work sharing group need to be tracked. Besides, the consideration of FLC allows us to analyze the optimal system structure for different changing trends of the fault coverage factor with the number of failed elements. In order to provide a useful reference to the practitioners, this paper extends the problem of finding the optimal balance between the two kinds of parallelization to the case of FLC.
Section 2 presents the model. Section 3 describes the UGF based algorithm for evaluating the reliability of series-parallel MSS with FLC. Section 4 discusses the optimization procedures with the genetic algorithm technique. Numerical examples are shown in Section 5 to illustrate the applications of the framework in different situations.
Section snippets
Model description and problem formulation
Consider a system consisting of M subsystems connected in series. Each subsystem m contains Em different elements connected in parallel. In each subsystem, the elements can be separated into several work sharing groups (WSG). In each WSG, the available elements share their work in an optimal way that maximizes the performance of the entire group. In case when some element fails in a WSG, the resource management system is able to redistribute the task among the available elements if the failure
Incorporating uncovered failures into the UGF technique
The pmf of a discrete random variable X can be represented by a UGF as [23], [16], [6], [15]where the variable X has H+1 possible values and εh=Pr {X=xh}. The UGF representing the pmf of a function of two independent random variables φ(X,Y) can be obtained using the following composition operator as
Since the fault coverage probability of a system element depends on the number of failed elements in the
Optimization technique
Eq. (5) formulates a complicated set partitioning problem. An exhaustive examination of all possible solutions is not realistic, considering reasonable time limitations. The genetic algorithm (GA) proves to be an effective optimization tool for a large number of complicated problems in reliability engineering [7], [28], [14], [21]]. To apply the GA to a specific problem the solution representation and the decoding procedures must be defined.
Illustrative examples
In order to illustrate the applications of the proposed framework in different situations, this section considers a data transmission system and a task processing system. Different assumptions of fault coverage values are discussed.
Conclusion
This paper extends the problem of finding optimal balance between redundancy and task sharing in multi-state systems with uncovered failures to the cases of multi-fault coverage. It is assumed that the uncovered failures in the elements belonging to the same work sharing group can cause failure of the entire group. Due to different fault covering mechanisms, the probability of such failure can be proportional to either the number of working elements or failed elements in the group when the
Acknowledgment
This work was supported in part by China NSFC under Grant 71231001, China Postdoctoral Science Foundation funded project under grant number 2013M530531, and the Fundamental Research Funds for the Central Universities under Grant FRF-MP-13-009 A and FRF-TP-13–026 A. This research is also partially supported by a grant from City University of Hong Kong (Project No.9380058).
References (28)
- et al.
Multi-state systems with selective propagated failures and imperfect individual and group protections
Reliability Engineering & System Safety
(2011) - et al.
A multi-state model for the reliability assessment of a distributed generation system via universal generating function
Reliability Engineering & System Safety
(2012) - et al.
Redundancy analysis for repairable multi-state system by using combined stochastic processes methods and universal generating function technique
Reliability Engineering & System Safety
(2009) - et al.
Assessment of redundant systems with imperfect coverage by means of binary decision diagrams
Reliability Engineering & System Safety
(2008) - et al.
Defending simple series and parallel systems with imperfect false targets
Reliability Engineering & System Safety
(2010) - et al.
Competing failure analysis in phased-mission systems with functional dependence in one of phases
Reliability Engineering & System Safety
(2012) - et al.
Combinatorial analysis of systems with competing failures subject to failure isolation and propagation effects
Reliability Engineering & System Safety
(2010) - et al.
Reliability of k-out-of-n systems with phased-mission requirements and imperfect fault coverage
Reliability Engineering & System Safety
(2012) - et al.
Some improvements on adaptive genetic algorithms for reliability-related applications
Reliability Engineering & System Safety
(2010) - Amari, S., 1997. Reliability, risk and fault-tolerance of complex systems. PhD thesis. Indian Institute of Technology,...
Optimal design of k-out-of-n: G subsystems subjected to imperfect fault-coverage
IEEE Transactions on Reliability
OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage
IEEE Transactions on Dependable and Secure Computing
Cited by (51)
Reliability analysis of dynamic fault trees with Priority-AND gates based on irrelevance coverage model
2022, Reliability Engineering and System SafetyLinear system design with application in wireless sensor networks
2022, Journal of Industrial Information IntegrationMaintenance model of aircraft structure based on three-stage degradation process
2021, Computers and Industrial EngineeringJoint optimization of lot sizing and condition-based maintenance for a production system using the proportional hazards model
2021, Computers and Industrial EngineeringReliability and maintenance modeling for a load-sharing k-out-of-n system subject to hidden failures
2020, Computers and Industrial EngineeringOptimum component reallocation and system replacement maintenance for a used system with increasing minimal repair cost
2020, Reliability Engineering and System Safety