Managing redundancy in CAN-based networks supporting N-Version Programming

https://doi.org/10.1016/j.csi.2007.11.007Get rights and content

Abstract

Software is a major source of reliability degradation in dependable systems. One of the classical remedies is to provide software fault tolerance by using N-Version Programming (NVP). However, due to requirements on non-standard hardware and the need for changes and additions at all levels of the system, NVP solutions are costly, and have only been used in special cases.

In a previous work, a low-cost architecture for NVP execution was developed. The key features of this architecture are the use of off-the-shelf components including communication standards and that the fault tolerance functionality, including voting, error detection, fault-masking, consistency management, and recovery, is moved into a separate redundancy management circuitry (one for each redundant computing node).

In this article we present an improved design of that architecture, specifically resolving some potential inconsistencies that were not treated in detail in the original design. In particular, we present novel techniques for enforcing replica determinism.

Our improved architecture is based on using the Controller Area Network (CAN). This choice goes beyond the obvious interest of using standards in order to reduce the cost, since all the rest of the architecture is designed to take full advantage of the CAN standard features, such as data consistency, in order to significantly reduce the complexity, the efficiency and the cost of the resultant system.

Although initially developed for NVP, our redundancy management circuitry also supports other software replication techniques, such as active replication.

Introduction

Software faults are widely accepted as one of the most important sources of unreliability in computer systems. Their effects can be so negative that there is a renewed interest in evaluating software risks [1]. Therefore design of systems which provide tolerance to software faults for critical applications is an important topic. A series of recent related projects [2], [3], [4] called DISCS, DISPO, DISPO-2 and DOTS have addressed the issue of evaluating the dependability provided by techniques for tolerance of software faults. Nevertheless the design of complete systems which are able to tolerate software faults is, perhaps, not receiving as much attention nowadays as the importance of software faults would suggest. One of the reasons is the high cost of that kind of systems.

Commercially available fault-tolerant systems often use application-specific multiprocessor architectures [5], [6], [7]. Their design and manufacturing costs have a strong impact on the final price and discourage potential buyers. When tolerance to software faults is also required, cost is further increased by the development of redundant – and diverse – application software.

According to [8], the reason fault-tolerant systems are not so widely used as the interest on the subject would suggest is that the fault-tolerant mechanisms are not orthogonal to the other functionalities of the system, in the sense that essentially all system components must be adapted to handle the fault tolerance. This makes it difficult to use low-cost commercial components based on standards in the design of a fault-tolerant system. Therefore, the development in the fault tolerance aspects may even impede development in other aspects. As a final result, the cost of a fault-tolerant system is much higher than the cost of a non-fault-tolerant system with equivalent performance, even if the cost of redundancy is not taken into account.

In an attempt to solve this problem, Miro-Julia [9] proposed a low-cost architecture for the execution of applications which tolerate software faults following the N-Version Programming (NVP) paradigm [10]. In NVP, N diverse versions of the same program are developed by independent teams. Each version is partitioned into a set of segments. Corresponding segments in different versions are intended to perform the same function. In execution, each time a version finishes a segment, it issues a vector of results of this segment, called cc-vector (see Fig. 1). Then a decision algorithm is executed on to obtain a consensus cc-vector which is sent back to all versions to be used in the continued computation. This mechanism, called cc-point, provides both synchronization among versions and masking of faults in a minority of versions.

In the low-cost architecture proposed in [9] (see Fig. 2) the N versions are executed, each one on a different computer. Moreover, each time a version finishes a segment, the corresponding cc-vector is immediately broadcast through a broadcast network equipped with a special hardware unit, called N-Version Executive Processor (NVXP). One NVXP is attached to each computer to manage the mentioned communication, as well as other key functions for fault tolerance support, such as executing the decision algorithm on the cc-vectors of all versions and returning the consensus cc-vector to the local version. In this architecture the transmission of cc-vectors among NVXPs is triggered each time a version finishes a segment, which corresponds to an event-triggered communication scheme that adapts to the diverse execution times of the versions.

Two of the properties of this architecture are, first, the mechanisms for fault tolerance are concentrated in the added NVXP and are thereby orthogonal to the other functionalities of the system and, second, hardware Commercial Off The Self (COTS) components based on standards are used as main computers and as building blocks for the implementation of the NVXP. However the specific design proposed in [9] for this architecture is difficult to implement in practice and presents some scenarios of inconsistency, e.g., the votings performed by different NVXPs may yield different results, which may cause even non-faulty versions to diverge, since they use different input in the subsequent executions.

In this paper we present a new architecture that we have devised in order to eliminate the scenarios of inconsistency mentioned above. Our architecture takes the one introduced in [9] as starting point. We have made the following three main changes in the hardware of this architecture. First, we have developed a completely new and improved design for the NVXPs, which we call Redundancy and Communication Management Boards (RCMBs). Second, in order to satisfy our low cost requirement, we have chosen standard PCs as platforms, i.e., the RCMBs are PC boards inserted in the bus of the host PCs (where the versions are executing). And third, we use the Controller Area Network (CAN) protocol [11] as the basic technology for the broadcast network, due to its well-known advantages related to cost, reliability and real-time performance, and due to the growing interest of using CAN for critical applications. With the choice of CAN we go beyond the obvious interest of using standards in order to reduce the cost, since all the rest of the architecture is designed to take full advantage of the CAN standard features in order to significantly increase the efficiency and reduce the complexity and the cost of the resultant system.

Note that for the rest of this paper we shall consider that a node of this architecture is any ensemble constituted by a PC and the RCMB that is directly attached to it.

Besides introducing changes in the hardware, we have designed a new software to be executed in the RCMBs. This software is responsible for the consistent management of the redundancy in this architecture. Two issues related to the consistent management of the redundancy constitute the main focus of this research: replica determinism enforcement [12] of all replicated operations and consistent reintegration of RCMBs after transient faults. Replica determinism enforcement ensures, for instance, that all non-faulty replicas of the voting procedure executed by the different RCMBs produce the same consensus cc-vector. Reintegration allows an RCMB that has been disconnected in order to prevent error propagation caused by transient faults to again be integrated in the system. The purpose of reintegration is to make sure that the redundancy of the system does not permanently attrite (degrade) when RCMBs are being disconnected due to transient faults. It should be noted that mechanisms for reintegration of versions and PCs that have suffered a transient fault are provided by NVP itself. Indeed given that in the cc-points described above all versions receive the resulting consensus cc-vector, not only fault masking is achieved, but also versions that have issued a wrong cc-vector – e.g. because of a transient fault in the corresponding computer – have an opportunity to recover using the consensus cc-vector values to resume computation. In contrast, the reintegration of RCMBs affected by transient faults is a new problem that has to be solved by our architecture. Due to space limitations, how we have achieved this reintegration is not described in this paper. A thorough description can be found in [13], where also the implementation of our entire architecture is described.

Besides low cost, real implementability and consistency, our design has another important requirement; the management of redundancy must be achieved without introducing a significant computation or communication overhead in the system. At the end, NVP performance strongly depends on the time spent waiting for the voting results.

Section snippets

Basic features of our architecture

The features that serve as basis for our replica determinism enforcement and reintegration mechanisms are related to the definition of the error containment boundaries and to the organization of the fault tolerance operations.

Replica determinism enforcement

As indicated in Section 1, the aspect of consistency that this paper focuses on is the replica determinism enforcement [12] of all components that have been replicated for fault tolerance purposes.

Roughly speaking, we can say that a group of replicas of the same operation exhibit replica determinism [12] when all non-faulty replicas show correspondence of replica outputs and/or state changes. This definition must be complemented by a correspondency requirement which indicates the meaning of

Related work

In this section we identify some similarities and differences of our system with other architectures. We shall focus on Delta-4 [19] and GUARDS [23] since all the concepts and techniques related to consistent management of redundancy were already mature at the time these architectures were developed. However, neither Delta-4 nor GUARDS present specific mechanisms for the execution of NVP applications. Though both pay special attention to the replica non-determinism problem. Moreover, GUARDS is

Conclusions

We have described a new architecture for an embedded distributed system that is tolerant to software faults through the execution of NVP applications. Our architecture is based on a previous design which is aimed at providing a low-cost infrastructure by keeping the orthogonality between, on the one hand, the mechanisms which are related to fault tolerance and, on the other hand, the rest of the functionality of the system. This opens room to the use of standard components. Taking this design

Julián Proenza received the first degree in physics and the doctorate in informatics from the University of the Balearic Islands (UIB), Palma de Mallorca, Spain, in 1989 and 2007, respectively.

He is currently a Lecturer in the Department of Mathematics and Informatics at the UIB. His research interests include dependable and real-time systems, fault-tolerant distributed systems, clock synchronization and field-bus networks such as CAN (Controller Area Network).

References (23)

  • E. Weyuker

    Difficulties measuring software risk in an industrial environment

  • B. Littlewood et al.

    Design diversity: an update from research on reliability modeling

  • P. Popov et al.

    Estimating bounds on the reliability of diverse systems

    IEEE Transactions on Software Engineering

    (2003)
  • P. Popov et al.

    Diversity for off-the-shelf components

  • J.H. Lala et al.

    Hardware and software fault tolerance: a unified architectural approach

  • C.J. Walter

    MAFT: an architecture for reliable fly-by-wire flight control

  • J.H. Wensley et al.

    SIFT: design and analysis of a fault-tolerant computer for aircraft control

    Proceedings of the IEEE

    (1978)
  • B.J. Gleeson

    Fault tolerance: why should I pay for it?

  • J. Miro-Julia, A network architecture capable of efficiently running fault-tolerant applications, Ph.D. thesis,...
  • A. Avižienis

    The N-Version approach to fault-tolerant software

    IEEE Transactions on Software Engineering

    (1985)
  • International Standard 11898 – Road Vehicles – Interchange of Digital Information — Controller Area Network (CAN) for High-Speed Communication

    (1993)
  • Cited by (0)

    1. Download : Download full-size image
    Julián Proenza received the first degree in physics and the doctorate in informatics from the University of the Balearic Islands (UIB), Palma de Mallorca, Spain, in 1989 and 2007, respectively.

    He is currently a Lecturer in the Department of Mathematics and Informatics at the UIB. His research interests include dependable and real-time systems, fault-tolerant distributed systems, clock synchronization and field-bus networks such as CAN (Controller Area Network).

    1. Download : Download full-size image
    José Miro-Julia is Licenciado con grado by the Universidad Complutense de Madrid since 1984. He received a doctorate in physics from the University of the Balearic Islands (UIB) in 1988 and a PhD. in computer science from UCLA in 1991.

    He is currently a Professor in the Department of Mathematics and Informatics at the UIB. His research interests include CS education, bioinformatics, baking (bread and cakes), and social networks.

    Prof. Miro-Julia is one of the founders and current president of AENUI (Asociaci\'on de Ense\~nantes Universitarios de la Inform\'atica, Society of CS University Educators).

    1. Download : Download full-size image
    Hans Hansson (A'01) received the M.Sc. degree in engineering physics, the Licentiate degree in computer systems, the B.A. degree in business administration, and the Doctor of Technology degree in computer systems from Uppsala University, Uppsala, Sweden, in 1981, 1984, 1984, and 1992, respectively.

    He is currently a Professor of Computer Engineering at Mälardalen University, Västerås, Sweden, where he is Director of research at the School of Innovation, Design, and Engineering. He is also Director of the Mälardalen Real-Time Research Centre and the Swedish national strategic Embedded Software research centre PROGRESS. He was previously Programme Director for the ARTES national Swedish real-time systems initiative, Department Chairman and Senior Lecturer at the Department of Computer Systems, Uppsala University, Chairman of the Swedish National Real-Time Association (SNART), and Researcher at the Swedish Institute of Computer Science, Stockholm, Sweden. His research interests include real-time system design, component-based software development, reliability and safety, timed and probabilistic modeling of distributed systems, scheduling theory, distributed real-time systems, and real-time communications networks.

    Prof. Hansson is a member of the Association for Computing Machinery and SNART.

    View full text