Dependable design technique for system-on-chip

https://doi.org/10.1016/j.sysarc.2007.09.003Get rights and content

Abstract

A technique for highly reliable digital design for two FPGAs under a processor control is presented. Two FPGAs are used in a duplex configuration system design, but better dependability parameters are obtained by the combination of totally self-checking blocks based on a parity predictor. Each FPGA can be reconfigured when a SEU fault is detected. This reconfiguration is controlled by a control unit implemented in a processor. Combinational circuit benchmarks have been considered in all our experiments and computations. All our experimental results are obtained from a XILINX FPGA implementation using EDA tools. The dependability model and dependability calculations are presented to document the improved reliability parameters.

Introduction

Systems realized by field programmable gate arrays (FPGAs) are more and more popular and widely used in more and more applications due to several properties and advantages:

  • High flexibility in achieving multiple requirements such as cost, performance and turnaround time.

  • Possible reconfiguration and later changes of the implemented circuit, e.g. only via radio net connections.

The FPGA circuits can be used in mission critical applications such as aviation, medicine, space missions, and railway applications as well [1], [2], [3].

Many FPGAs are based on SRAM memories sensitive to Single Even Upsets (SEUs), therefore a simple usage of FPGA circuits in mission critical applications without using any method of error detection is impossible.

One change of a bit in the configuration memory leads to a change of a circuit function, often drastically. The concurrent error detection (CED) techniques allow a faster detection of soft errors (errors which can be corrected by reconfiguration) caused by SEUs [4], [5], [6]. SEUs can change also the content of embedded memory, Look-up Tables (LUTs) and other configuration bits. These changes are not detectable by off-line tests, therefore CED techniques have to be used. The probability of a SEU occurrence in the SRAM is described in [7].

Several CED schemes for the reliable computing system design have been proposed and used commercially. These techniques differ mainly in their error detection capabilities and in the constraints required for the system design. There are many publications on system design with concurrent error detection [8], [9], [10], [11], [12], [13], [14], [15], [16], [17]. They include the design of datapath circuits (e.g., adders, multipliers) and general combinational and sequential logic circuits with the concurrent error detection. Almost all publications on CED focus on their area/performance overhead. CED techniques (based on hardware duplication, parity codes [19], etc.) are widely used to enhance system dependability parameters. All CED techniques introduce some form of redundancy. All the above-mentioned CED techniques guarantee system data integrity against single faults.

However, these CED schemes are vulnerable to multiple faults and common-mode failures (CMFs). Common-mode failures are a special and very important case of multiple faults that generally occur due to a single cause. System data integrity is not guaranteed in the presence of CMFs. They include operational failures that may be caused by external effects (e.g., EMI, power-supply disturbances and radiation) or internal causes. CMFs in redundant VLSI systems are surveyed in [9], [18].

A paper addressing the issue of self-checking FPGA design based on the adoption of error detection codes (e.g., Berger code, Parity code) as an evolution of the traditional approaches developed in the past for the ASIC platform was presented in [20]. The used method is based on the design techniques that allow hardware fault detection in a combinational network through information redundancy at functional/gate level. These techniques allow to use more complete methodology of dynamically reconfigurable FPGAs, which includes recovering from fault states. It means that the system can be reconfigured when the fault is on-line detected.

A self-checking (SC) circuit (Fig. 1) is one possible realization of the CED scheme. The SC circuit is typically composed of the original circuit, parity predictor and checker. The parity predictor is used to generate check bits and together with primary outputs realize a code word. The check bits are generated from primary inputs. The checker depends on the error detection (ED) code used for the parity predictor. Many papers have been published on this topic [21], [22].

In many publications the quality of the SC circuit is characterized by the number of detected faults. However, in many cases when the fault coverage is high (almost 100%) the area overhead is also high. The high area overhead decreases the fault tolerant properties, and it is important to find some trade-off between the area overhead and the fault coverage. These requirements have been taken into account in this paper.

CED techniques based on ED codes are widely used. However, many research groups have not evaluated the totally self-checking (TSC), fault secure (FS) and self-testing (ST) properties of the final circuit. Many publications describe only the TSC parameter. But this parameter provides insufficient information about all faults in a circuit implemented in FPGA. The hidden faults are not taken into account. Therefore a new fault classification is proposed to describe faults in FPGA caused by SEU in this paper.

A fault tolerant system must satisfy fault masking requirements. A fault occurring in such a system is detected and does not lead to an incorrect function. If the system has no repair capability, it must be stopped after the next fault is detected. A fault tolerant system protected against SEU must also be reliable. However, CED techniques increase the final area, and techniques to increase the reliability parameters based on a single FPGA are not sufficient. Some publications have focused on reliable systems based on a single FPGA using a triple module redundancy (TMR) structure [4], [23], [24].

The TMR structure is unsuitable when a high area overhead is unacceptable. Some hybrid architecture must be used. TMR architecture and a hybrid system, e.g., the modified duplex system with a comparator and some mostly used types of CED techniques are compared in [25], [26]. A technique based on duplication with comparison (DWC) and the CED technique are described in [27], [28].

The fault tolerant system proposed in this work is based on DWC–CED with reconfiguration. This paper presents the methods for maximum enhancement of the dependability parameters while maintaining the minimal area overhead. The complex structure implemented in each FPGA is divided into small blocks, where every block satisfies TSC properties. This approach can detect a fault before the fault is detected by the output comparator.

Design methodology plays an important role in fault tolerant systems based on a self-checking circuit. The methodology of self-checking code selection was presented in [29], [30]. This methodology assumes that the circuits are described by multilevel logic and are realized by ASICs. The synthesis process of this self-checking circuit is different from the classical method. Each part of the self-checking circuit is synthesized individually, due to possible sharing of logic primitives among these blocks. Sharing the logic decreases the number of detected faults. Some papers describing methodologies for VHDL automatic modification have been published [31], [32].

A design flow for protecting an FPGA-based system against SEUs is presented in [4]. This paper presents a design flow for developing a circuit resilient to SEUs which is composed of standard professional tools and ad-hoc developed tools, too. Experiments are performed on benchmark circuits and on a realistic circuit to show the capabilities of the proposed design flow.

There is another on-line testing approach that the implemented design does not take into account. The on-line test is processed for a whole FPGA, without disturbing the normal system operation [33], [34]. In this case, the structural test is performed.

The possibilities how to keep proper system functions are always based on some redundancy. Redundancy always means great area and/or time overhead. Our proposed structure increases dependability parameters together with ensuring a relatively low area overhead compared with classical methods such as duplication or triplication [35]. The term dependability is used to encapsulate the concepts of reliability, availability, safety, maintainability, performability, and testability.

  • Reliability is a conditional probability, in that it depends on the system being operational at the beginning of the chosen time interval. The reliability of a system is a function of time R(t).

  • Availability is a function of time A(t), defined as the probability that a system is operating correctly and is available to perform its functions at instant of time t.

  • Safety is the probability S(t) that a system will either perform its functions correctly or will discontinue its functions in a manner that does not disrupt the operation of other systems or compromise the safety of any people associated with the system.

  • Performability of a system is a function of time P(L, t), defined as the probability that the system performance will be at or above some level L, at instance of time t. In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of hardware and software faults, but the level of performance is somehow diminished.

  • Maintainability is a measure of the ease with which a system can be repaired once it has failed. In more quantitative terms, maintainability is the probability M(t), that a failed system will be restored to an operational state within a period of time t. The restoration process includes locating the problem, physically repairing the system, and bringing the system back to its operational condition.

  • Testability is simply the ability to test for certain attributes within a system. Measures of testability allow us to assess the ease with which certain tests can be performed.

These parameters are described more precisely in [36].

The availability computation is utilized to compare our modified duplex system with a standard duplex system and TMR system in this paper.

Our solution combines on-line testing design methods with a classical duplex design. It assumes the dynamic reconfiguration of the faulty part of the system after an on-line fault detection. The most important criterion is the speed of the fault detection and the safety of the whole circuit with respect to the application requirement.

Our previous research shows the relation between the area overhead and the SEU fault coverage [37]. Due to the need for a small area overhead, the SEU fault coverage for most circuits is less than 100%. The SEU fault coverage varies typically from 75% to 95%. Therefore we have to use an additional method to keep the TSC property. Combinational circuit benchmarks have been considered in all our experiments and computations. All our experimental results are obtained by a XILINX FPGA implementation by means of EDA tools. The dependability model and dependability calculations based on Markov chains are presented.

This paper comprises partial research results based on software and hardware simulation experiments presented in [38], [39], [40]. Our research aims at the compound design methods for a system-on-a-chip (SoC) composed from several FPGA blocks, processor, memories, etc. We assume some or all FPGA blocks to be dynamically reconfigurable. The resulting SoC design contains several independent FPGA block where each of them can be reconfigured independently.

The paper is organized as follows: first, basic terms concerning the classification of the faults are presented in Section 2. The proposed structure to be implemented in FPGAs is described in Section 3. The dependability models and computations are presented in Section 4. Section 5 describes design methodology for SoC. Fault injection and simulation is described in Section 6. Section 7 summarizes and expresses the results obtained from these models by several graphs and Section 8 concludes the paper.

Section snippets

Basic on-line testing property

There are three basic quantitative property in a field of CED: fault security (FS), self-testing (ST) and totally self-checking (TSC) properties [36].

  • Fault secure: Under each modeled fault the erroneous outputs that are produced do not belong to the output error detection code. The reason for this property is obvious; if an erroneous output belongs to the error detection code, the error is not detected and the TSC goal is not achieved. Thus, the fault secure property is the most important

Proposed structure

Our previous results show that to fulfill the TSC property to reach 100% is difficult, so we have proposed a new structure based on two FPGAs, see Fig. 2.

Each FPGA has one primary input, one primary output and two pairs of checking signals OK/FAIL. The probability of the information correctness depends on the FS property. When the FS property is satisfied only to 75%, the correctness of the checking information is also 75%. It means that the signal “OK” give a correct information for 75% of

Dependability analysis

To evaluate the influence of a sequence of the SEUs faults, a more precise definition of “a single fault” is needed. Availability computations for dependability analysis are used. In the following text we will assume that a “single data damaging” is defined as follows:

  • It will occur at a single time that is arbitrarily located at the time axis.

  • The fault can change a data item located within the FPGA configuration memory. Both FPGAs can be affected with the same probability. We assume the single

Design flow

The design methodology of TSC circuit creation is described in Fig. 5. To generate the output parity bits, all the output values have to be defined for each input vector. Unfortunately, it is not valid in the benchmark definition files. Only several output values are specified for each multi-dimensional input vector, the rest are assigned as do not cares; they are left to be specified by another term. Thus, to be able to compute the parity bits, we have to split the intersecting terms so that

Implementation and HW simulation

Every reaction on the input vector change must be calculated in the SW simulation. Every step of the simulation takes many processor cycles especially for the circuits with many gates. On the other side, if we process one simulation step, the time needed for the calculation is equal to one system cycle. But the results should be compared and evaluated concurrently. This leads us to utilize the HW simulator.

A fault injection into implemented circuit allows us to calculate dependability

Dependability calculation results

The availability computations were used to compare our modified duplex system with a standard duplex system and with TMR system. This section follows the previous Section 4 “Dependability analysis” describing our modified duplex system with the Markov model and with dependability equations.

Firstly, the model parameters are discussed. The failure rate (λ) depends on the probability that the impacting SEUs will change a bit in the FPGA configuration memory. The effect of the SEUs impacting on

Conclusion and future work

Our modified duplex system based on two FPGAs has been presented. Our system increases dependability parameters for the standard duplex system. The dependability parameters have been increased due to the reconfiguration process and two methods of the SEUs detection. The first method compares the primary output of each FPGA and the second one signalizes a faulty FPGA. We described the system by the Markov dependability model. This model was used for the computation of the availability parameters

Acknowledgement

This research has been partly supported by the MSM6840770014 research program.

Pavel Kubalík, is Assistent Professor at the Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague. He performed his Ph.D defence in September 2007. His research interests include: digital design, on-line testing methods especially for FPGA, dependability computations, fault injection methods, hardware impelentation of special applications for FPGA and microprocessors.

References (46)

  • R. Dobiáš, H. Kubátová, FPGA based design of raiway’s interlocking equipment, in: Proceedings of the EUROMICRO...
  • Ratter, D. “ FPGAs on Mars”, www.xilinx.com,Xcell Journal Online,...
  • Actel Corporation.: “Historic Phoenix Mars Mission Flies Actel RTAX-S Devices”, www.actel.com,...
  • L. Sterpone, M. Violante, A design flow for protecting FPGA-based systems against single event upsets, DFT2005, in:...
  • QuickLogic Corporation.: Single Event Upsets in FPGAs, 2003,...
  • M. Bellato et al.

    Evaluating the effects of SEUs affecting the configuration memory of an SRAM-based FPGA

    Design Automation Event for Electronic System in Europe

    (2004)
  • E. Normand

    Single event upset at ground level

    IEEE Transactions on Nuclear Science

    (1996)
  • A. Krasniewski, Concurrent error detection in sequential circuits implemented using FPGAs with embedded memory blocks,...
  • S. Mitra, E.J. McCluskey, Which concurrent error detection scheme to choose, in: Proceedings of the International Test...
  • S. Mitra, E.J. McCluskey, Diversity techniques for concurrent error detection center for reliable computing, Dept. of...
  • K. Elshafey, J. Hlavicka, On-line detection and location of faulty CLBs in fpga-based systems, in: IEEE DDECS Workshop,...
  • A. Paschalis et al.

    Concurrent delay testing in totally self-checking system

    (1998)
  • P. Drineas, Y. Makris, Concurrent fault detection in random combinational logic, in: Proceedings of the IEEE...
  • K. Mohanram, E.S. Sogomonyan, M. Gössel, N.A. Touba, Synthesis of low-cost parity-based partially self-checking...
  • C. Bolchini et al.

    Fault analysis for networks with concurrent error detection

    IEEE Design and Test

    (1998)
  • C. Bolchini, F. Salice, D. Sciuto, R. Zavaglia, An integrated design approach for self-checking FPGAs, in: Proceedings...
  • J.S. Piestrak

    Self-checking design in eastern europe

    IEEE Design and Test of Computers

    (1996)
  • S. Mitra et al.

    Common-mode failures in redundant VLSI systems

    A Survey IEEE Transaction Reliability

    (2000)
  • J. Adamek

    Foundations of Coding

    (1991)
  • C. Bolchini, F. Salice, D. Sciuto, D, Designing self-checking FPGAs through error detection codes, in: 17th IEEE...
  • S.J. Piestrak

    Design of self-testing checkers for m-out-of-n codes using parallel counters

    (1998)
  • D. Nikolos

    Self-testing embedded two-rail checkers

    (1998)
  • K. Nakahara, S. Kouyama, T. Izumi, H. Ochi, Y. Nakamura, Autonomous-repair cell for fault tolerant...
  • Cited by (17)

    • The influence of implementation type on dependability parameters

      2013, Microprocessors and Microsystems
      Citation Excerpt :

      Notable examples are the Duplex Scheme [3] or Triple Modular Redundancy [4]. For our study, we have chosen the less explored Modified Duplex System (MDS) [5] (see Section 3). The achieved dependability parameters must be estimated and checked against specification as the last step.

    • Soft core based embedded systems in critical aerospace applications

      2011, Journal of Systems Architecture
      Citation Excerpt :

      In order to mitigate the radiation-induced effects, designers have usually applied redundant hardware within the system. From low level structures, using techniques like: Error-Correcting Code (ECC), parity bits, Triple Modular Redundancy (TMR) [22] or Single Error Correction (SEC) Hamming code [23]; up to more complex components like functional units [24], co-processors [25], duplex systems based on two FPGAs [26]; or by means of exploiting the multiplicity of hardware blocks available on multi-threaded/multi-core architectures [27,28]. More recent techniques, propose selective hardening of the system [29], adding protection only to the most vulnerable parts.

    • On-line software-based self-test for ECC of embedded RAM memories

      2017, 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017
    • Enhanced Duplication Method with TMR-Like Masking Abilities

      2016, Proceedings - 19th Euromicro Conference on Digital System Design, DSD 2016
    • Parity Waterfall method

      2016, Formal Proceedings of the 2016 IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, DDECS 2016
    View all citing articles on Scopus

    Pavel Kubalík, is Assistent Professor at the Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague. He performed his Ph.D defence in September 2007. His research interests include: digital design, on-line testing methods especially for FPGA, dependability computations, fault injection methods, hardware impelentation of special applications for FPGA and microprocessors.

    Hana Kubátová, is Associate Professor at the same department. Her research interests include Petri nets in modelling, simulation and hardware design, automata theory, digital systems design, dependable design, FPGA implementation methods.

    View full text