Evaluation of process controller fault tolerance using simulation
Introduction
Many digital systems are used in an environment where a failure may lead to heavy material losses, to jeopardising of lives or project integrity, etc. For these purposes a special kind of system has been developed, called fault-tolerant (FT) systems. At the beginning, fault tolerance was a rare property privileged to special projects like cosmic research, military and security operations, etc. With technological progress and with the development of less costly design and implementation methods, the fault tolerance became an almost common feature, which can be found in practically all domains of information processing.
The concept of fault-tolerant system has been introduced in early works of Avizienis [5], [6]. A system is fault tolerant if it performs correctly all its functions even in presence of hardware faults and software errors (including erroneous data). More specifically, the system function is considered to be correct, if the system does not stop or essentially alter its function, data processing results are correct and are delivered in time.
Several different mechanisms of fault tolerance have been developed so far, some of them being in common use in the data processing environment. The unifying feature of all these mechanisms is the use of redundancy. The most frequently used forms of redundancy are hardware, software, time and information redundancy. Each of them has its specific role in a fault-tolerant system, so that depending on the type of application area, some subset of these forms can be found. It belongs to the characteristic features of redundancy, that practically always several forms are used at the same time, like e.g. hardware redundancy accompanied with extra delay of signal propagation (time redundancy), etc.
The redundancy in the fault-tolerant system is used above all to guarantee a deterministic reaction to a fault. This reaction mostly consists of the following steps:
Fault detection. A process of recognising that a fault has occurred. It is mostly required before any further action can be taken.
Fault location. Process of determining where the fault has occurred.
Fault containment. Process of isolating the faulty area and preventing the effects of that fault from propagating throughout the system.
Fault recovery. Process of regaining the operational status via reconstruction of data and/or reconfiguration of hardware which was determined as faulty.
A very efficient mechanism of fault tolerance, which virtually combines all the above steps into one, is fault masking, which is based on the use of error correcting codes. One of the most widely used codes is the triplication with majority decoding. The properties of this mechanism will be studied in this paper.
Majority voting has been used many times and implemented in different ways, be it in hardware or in software [6], [16]. Although its properties have been studied and verified many times, there are still several features to be investigated, above all the influence of different parameters on the resulting system properties summarised as its dependability.
In an N-Modular Redundant (NMR) system N computational modules execute identical tasks. They need to be synchronised periodically by voting on the current computational state and – if necessary – set to the same state as the majority of modules. If there is no majority state, some alternative procedure is chosen, like e.g. recomputing the new state from the previous result. If we use only one majority element, its unreliability could turn out to be a bottleneck of the system, hence the multiplication of majority elements (distributed voting) is preferred in practical projects.
Presently there are many application areas of fault-tolerant computing. Some fault-tolerant features like RAID disk arrays [12] belong to standard equipment without being questioned whether or not the investment into redundancy is worthwhile. Computer architecture, which is generally considered to be a typical environment for the application of fault tolerance, is a multiprocessor system [14]. A very important field, where the demand for fault tolerance appeared recently, is e.g. Internet [13], [20]. On the other hand, areas, where the fault tolerance is required almost automatically, are e.g. transportation [2], [9], [19], telecommunications [15], real-time control [18], nuclear power engineering [1] and naturally the space research [7].
Verification of dependability is one of the most important steps in the design of a fault-tolerant system. This includes testing the system’s fault tolerance, i.e. its reaction to real-time disturbances from its environment and to faults inside and outside the system. Finding an appropriate verification method may save a considerable amount of time, expenses and manpower, therefore it is paid ever-growing attention. Different verification methods were suggested e.g. in [3], [4], or [8]. They aim at evaluation of reliability parameters with the aid of analytical or simulation models, predominantly at a high level of abstraction.
The method used here is based on digital simulation whose output is used both for qualitative and quantitative evaluation. It uses a close-to-real code describing the FT controller function together with the computation dynamics and a submodel of the controller’s environment. Thus the controller behaviour (e.g. in the presence of transient faults) can be studied on the background of the controlled system’s behaviour. The paper presents some of the results obtained by the research groups at the CTU Prague and UWB Pilsen. First the problem to be solved is formulated and put into the context of possible FT system application areas. The main features of the selected class of FT system structures and system’s operating mode are then described. The main attention is paid to the modelling of the system, its environment and to the simulation of system behaviour without and with faults. The experimental results are then presented, summarised and evaluated.
Section snippets
Controller function and architecture
Four controller configurations will be compared in the subsequent text: hardware redundancy with three modules (TMR – triple modular redundant), time redundancy with one repetition (two runs) of the control algorithm, time redundancy with two repetitions (three runs) of the control algorithm, and finally no redundancy (for the purpose of comparison).
The controller is working in an infinite program loop whose single iteration will be denoted as a working cycle. We assume that the data used
Simulation model
The study of system properties presented here is based on discrete digital simulation. A static view (i.e. the simulation program text) of the object-oriented simulation model consists of a static set of classes describing types of modelled objects. Part of them contains a program of an object’s own activity, which runs concurrently with other objects’ activities. The run-time structure of the simulation model contains a dynamic set of pseudo-parallel processes (instances of a class of active
Modelled application
The described structure and function of the controller has been designed as universal. The structure of the model and the used modelling methodology are to a great extent application independent (assuming the category of embedded control-loop applications). Thus we need a simple application to demonstrate the possibilities of the presented simulation method.
The benchmark-like function of the controller we used is a qualified prediction of a discrete value of a one-dimensional signal varying in
Conclusions
A method of evaluation of a fault-tolerant system using simulation was presented and demonstrated on an example of a process controller. From the point of view of modelling methodology it should be pointed out, that we model a function of a real-time program, which can be multithreaded or distributed. For the correctness of the results obtained it is important, that the model of the program is the program itself (we use a close-to-original C code) plus its dynamics (i.e., added duration of
Acknowledgements
This work has been in part supported by the grant contract 102/98/1463 of the Grant Agency of the Czech Republic and by the Ministry of Education, Youth and Sport Research Project “Information systems and Technologies” (IC: MSM 235200005).
References (21)
- et al.
A general purpose fault-tolerant microcomputer system based upon single-board microcomputers
Microprocess. Microprogramm.
(1983) - G. Allain-Morin, O. Pourret, Dependability assessment of a computerized nuclear protection system. In: Proceedings of...
- A.M. Amendola, et al., Experimental evaluation of computer-based railway control systems. In: Proceedings of the...
- J. Arlat, K. Kanoun, Modelling and dependability evaluation of safety systems in control and monitoring applications....
- J. Arlat, Y. Crouzet, J.C. Laprie, Fault injection for dependability validation of fault-tolerant computing systems....
- A. Avizienis, Architecture of fault-tolerant computing systems. In: Proceedings of the FTCS-5, 28 International...
The n-version approach to fault-tolerant software
IEEE Trans. Software Eng.
(1985)- R. Gerlich et al., Formal methods for the validation of autonomous spacecraft fault tolerance. In: Proceedings of the...
- K.K. Goswami, K.I. Ravinshankar, Simulation of software behavior under hardware Faults. In: Proceedings of the FTCS-23...
- G. Heiner, T. Thurner, Time-triggered architecture for safety-related distributed real-time systems in transportation...
Cited by (10)
Designing fault-injection experiments for the reliability of embedded systems
2012, AIAA/IEEE Digital Avionics Systems Conference - ProceedingsAn approach of mission completion success probability prediction for circuits based on Saber simulation
2011, ICRMS'2011 - Safety First, Reliability Primary: Proceedings of 2011 9th International Conference on Reliability, Maintainability and SafetyExploring good cache architecture
2009, Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical UniversityFault injection for semi-parametric reliability models
2005, IEEE Aerospace Conference ProceedingsSimulation approach to embedded system programming and testing
2004, Proceedings - 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, ECBS 2004Dependability evaluation of time triggered architecture using simulation
2004, Computing and Informatics