Evaluation of process controller fault tolerance using simulation

https://doi.org/10.1016/S0928-4869(00)00009-4Get rights and content

Abstract

The paper presents a study of several alternatives of a fault-tolerant process controller design. We compare controller architectures based on different amount of hardware redundancy with those using time redundancy. The system behaviour is evaluated by means of a process-oriented simulation model enabling software injection of faults. As an overall measure of controller design quality (which includes both performance and reliability) we use the numerical error of the output. The results obtained on the model are used to show the dependence of the output error upon the relative speed of computation and upon the rate of faults damaging the data. Thus for every set of parameters, a system configuration which gives the best results, can be determined.

Introduction

Many digital systems are used in an environment where a failure may lead to heavy material losses, to jeopardising of lives or project integrity, etc. For these purposes a special kind of system has been developed, called fault-tolerant (FT) systems. At the beginning, fault tolerance was a rare property privileged to special projects like cosmic research, military and security operations, etc. With technological progress and with the development of less costly design and implementation methods, the fault tolerance became an almost common feature, which can be found in practically all domains of information processing.

The concept of fault-tolerant system has been introduced in early works of Avizienis [5], [6]. A system is fault tolerant if it performs correctly all its functions even in presence of hardware faults and software errors (including erroneous data). More specifically, the system function is considered to be correct, if the system does not stop or essentially alter its function, data processing results are correct and are delivered in time.

Several different mechanisms of fault tolerance have been developed so far, some of them being in common use in the data processing environment. The unifying feature of all these mechanisms is the use of redundancy. The most frequently used forms of redundancy are hardware, software, time and information redundancy. Each of them has its specific role in a fault-tolerant system, so that depending on the type of application area, some subset of these forms can be found. It belongs to the characteristic features of redundancy, that practically always several forms are used at the same time, like e.g. hardware redundancy accompanied with extra delay of signal propagation (time redundancy), etc.

The redundancy in the fault-tolerant system is used above all to guarantee a deterministic reaction to a fault. This reaction mostly consists of the following steps:

Fault detection. A process of recognising that a fault has occurred. It is mostly required before any further action can be taken.

Fault location. Process of determining where the fault has occurred.

Fault containment. Process of isolating the faulty area and preventing the effects of that fault from propagating throughout the system.

Fault recovery. Process of regaining the operational status via reconstruction of data and/or reconfiguration of hardware which was determined as faulty.

A very efficient mechanism of fault tolerance, which virtually combines all the above steps into one, is fault masking, which is based on the use of error correcting codes. One of the most widely used codes is the triplication with majority decoding. The properties of this mechanism will be studied in this paper.

Majority voting has been used many times and implemented in different ways, be it in hardware or in software [6], [16]. Although its properties have been studied and verified many times, there are still several features to be investigated, above all the influence of different parameters on the resulting system properties summarised as its dependability.

In an N-Modular Redundant (NMR) system N computational modules execute identical tasks. They need to be synchronised periodically by voting on the current computational state and – if necessary – set to the same state as the majority of modules. If there is no majority state, some alternative procedure is chosen, like e.g. recomputing the new state from the previous result. If we use only one majority element, its unreliability could turn out to be a bottleneck of the system, hence the multiplication of majority elements (distributed voting) is preferred in practical projects.

Presently there are many application areas of fault-tolerant computing. Some fault-tolerant features like RAID disk arrays [12] belong to standard equipment without being questioned whether or not the investment into redundancy is worthwhile. Computer architecture, which is generally considered to be a typical environment for the application of fault tolerance, is a multiprocessor system [14]. A very important field, where the demand for fault tolerance appeared recently, is e.g. Internet [13], [20]. On the other hand, areas, where the fault tolerance is required almost automatically, are e.g. transportation [2], [9], [19], telecommunications [15], real-time control [18], nuclear power engineering [1] and naturally the space research [7].

Verification of dependability is one of the most important steps in the design of a fault-tolerant system. This includes testing the system’s fault tolerance, i.e. its reaction to real-time disturbances from its environment and to faults inside and outside the system. Finding an appropriate verification method may save a considerable amount of time, expenses and manpower, therefore it is paid ever-growing attention. Different verification methods were suggested e.g. in [3], [4], or [8]. They aim at evaluation of reliability parameters with the aid of analytical or simulation models, predominantly at a high level of abstraction.

The method used here is based on digital simulation whose output is used both for qualitative and quantitative evaluation. It uses a close-to-real code describing the FT controller function together with the computation dynamics and a submodel of the controller’s environment. Thus the controller behaviour (e.g. in the presence of transient faults) can be studied on the background of the controlled system’s behaviour. The paper presents some of the results obtained by the research groups at the CTU Prague and UWB Pilsen. First the problem to be solved is formulated and put into the context of possible FT system application areas. The main features of the selected class of FT system structures and system’s operating mode are then described. The main attention is paid to the modelling of the system, its environment and to the simulation of system behaviour without and with faults. The experimental results are then presented, summarised and evaluated.

Section snippets

Controller function and architecture

Four controller configurations will be compared in the subsequent text: hardware redundancy with three modules (TMR – triple modular redundant), time redundancy with one repetition (two runs) of the control algorithm, time redundancy with two repetitions (three runs) of the control algorithm, and finally no redundancy (for the purpose of comparison).

The controller is working in an infinite program loop whose single iteration will be denoted as a working cycle. We assume that the data used

Simulation model

The study of system properties presented here is based on discrete digital simulation. A static view (i.e. the simulation program text) of the object-oriented simulation model consists of a static set of classes describing types of modelled objects. Part of them contains a program of an object’s own activity, which runs concurrently with other objects’ activities. The run-time structure of the simulation model contains a dynamic set of pseudo-parallel processes (instances of a class of active

Modelled application

The described structure and function of the controller has been designed as universal. The structure of the model and the used modelling methodology are to a great extent application independent (assuming the category of embedded control-loop applications). Thus we need a simple application to demonstrate the possibilities of the presented simulation method.

The benchmark-like function of the controller we used is a qualified prediction of a discrete value of a one-dimensional signal varying in

Conclusions

A method of evaluation of a fault-tolerant system using simulation was presented and demonstrated on an example of a process controller. From the point of view of modelling methodology it should be pointed out, that we model a function of a real-time program, which can be multithreaded or distributed. For the correctness of the results obtained it is important, that the model of the program is the program itself (we use a close-to-original C code) plus its dynamics (i.e., added duration of

Acknowledgements

This work has been in part supported by the grant contract 102/98/1463 of the Grant Agency of the Czech Republic and by the Ministry of Education, Youth and Sport Research Project “Information systems and Technologies” (IC: MSM 235200005).

References (21)

  • A. Rial et al.

    A general purpose fault-tolerant microcomputer system based upon single-board microcomputers

    Microprocess. Microprogramm.

    (1983)
  • G. Allain-Morin, O. Pourret, Dependability assessment of a computerized nuclear protection system. In: Proceedings of...
  • A.M. Amendola, et al., Experimental evaluation of computer-based railway control systems. In: Proceedings of the...
  • J. Arlat, K. Kanoun, Modelling and dependability evaluation of safety systems in control and monitoring applications....
  • J. Arlat, Y. Crouzet, J.C. Laprie, Fault injection for dependability validation of fault-tolerant computing systems....
  • A. Avizienis, Architecture of fault-tolerant computing systems. In: Proceedings of the FTCS-5, 28 International...
  • A. Avizienis

    The n-version approach to fault-tolerant software

    IEEE Trans. Software Eng.

    (1985)
  • R. Gerlich et al., Formal methods for the validation of autonomous spacecraft fault tolerance. In: Proceedings of the...
  • K.K. Goswami, K.I. Ravinshankar, Simulation of software behavior under hardware Faults. In: Proceedings of the FTCS-23...
  • G. Heiner, T. Thurner, Time-triggered architecture for safety-related distributed real-time systems in transportation...
There are more references available in the full text version of this article.

Cited by (10)

View all citing articles on Scopus
View full text