A low-cost fault tolerant solution targeting commercial FPGA devices

https://doi.org/10.1016/j.sysarc.2013.09.011Get rights and content

Abstract

Technology scaling, in conjunction to the trend towards higher operation frequency, results in increased thermal stress, which in turn leads to upsets due to reliability degradation. In this paper, we introduce a software-supported framework targeting to enable sufficient fault coverage against upsets occurred due to aging phenomena. Experimental results with a number of industrial oriented DSP kernels shown the effectiveness of our framework, since we achieved average improvement in terms of maximum operation frequency and power consumption by 15% and 70%, respectively, as compared to a well-established commercial solution, for comparable fault masking.

Introduction

More than at any other time, global economics favor programmable chips over costly Application-Specific Integrated Circuits (ASICs) and Application-Specific Standard Products (ASSPs), since the costs and risks associated with application-specific devices can only be justified for a short list of ultra-high volume commodity products. Hence, programmable platforms, and more specifically Field-Programmable Gate Arrays (FPGAs), have become the only viable means for today’s companies to meet increasingly stringent product requirements – cost, power, performance, and density – in a business environment characterized by spiraling complexity, shrinking market windows, fickle market demands, capped engineering budgets, escalating ASIC and ASSP non-recurring engineering costs, and increased risk.

Recently, a number of FPGA platforms with increased performance and densities were released from reconfigurable industry (e.g. Virtex-6, Virtex-7, EasyPath series, Stratix-IV, Stratix-V, etc.) [1], [2]. Even though these devices exhibit superior system-level functionality, they still consume more power, as compared to the rest hardware implementations (e.g., ASICs, ASSPs) [3].

Meeting power, and consequently thermal budget, is an essential criterion by which customers measure the success of their FPGA-based designs. The importance of this problem becomes far more savage because power density in FPGA devices is almost doubled every three years [4], [5], [6], while this trend is expected to be increased with technology scaling according to “A-power” law [7]. Among others, higher power and thermal stress imposes that target architecture is more likely to suffer from reliability degradation, and hence to decrease the mean time between failures (MTBF) [8], [9], [10], [11]. For instance, regarding commercial grade FPGAs, the maximum die temperature without performance degradation is reported as 80 °C, whereas the absolute maximum temperature is 125 °C [3]. Furthermore, based on previously published data, an average-sized design mapped onto a Virtex-E FPGA with 90% device utilization could lead to die temperature of 50 °C above the ambient temperature value [3].

In order to highlight the importance of reliability degradation, a number of fault tolerant systems have been proposed. These solutions allow a design to continue its operation, possibly at a reduced level, rather than failing completely, when some part of the system fails.

Existing approaches in designing fault tolerant systems are applicable either in hardware, or software level. Previous analysis shown that hardware-based fault tolerance exhibits superior performance, as compared to alternative (software-based) implementations, but they impose increased fabrication cost [12], [13], [14], [15], [16], [17], [18], [19], [20]. The derived platforms exhibit static fault masking, which is defined at fabrication time. On the other hand, potentially the software-based fault tolerant solutions combine the required dependability level with the low cost of commodity devices [21], [22], [23], [24], [25].

Apart from novel methodologies, a number of CAD tools that software support these methodologies, have also to be developed [21], [22], [23], [25]. These approaches affect mainly academic solutions, while up to now the the only known commercially available framework for providing fault masking at FPGA, is the Xilinx Triple Modular Redundancy (TMR) [26]. The principle idea of TMR is the usage of hardware redundancy to mask any single failure by voting on the result of three identical copies of the circuit.

Even though available solutions, and especially Xilinx TMR, provide the maximum fault coverage both at combinational and sequential logic, the consequent mitigation cost (in terms of power consumption, delay degradation and area overhead) indicates an increased awareness that this approach is acceptable only for mission critical systems [12]. However, the technology scaling imposes that upsets due to reliability degradation are also critical for the majority of consumer products [5], [8], [9], [10]. Hence, novel techniques that could provide sufficient fault coverage with the minimum mitigation cost, are absolutely required.

As an improvement to Xilinx TMR, a number of methodologies and CAD tools have been proposed [27], [28], [29], [30], [31]. These solutions cluster hardware resources based on their sensitivity to errors, and then apply redundancy selectively only to suspicious portions of the design. Even though these approaches lead to superior performance compared to Xilinx TMR, existing solutions focus solely on masking Single Event Upsets (SEUs) [32]. On the other hand, throughout this paper, we introduce a software-supported framework, targeting both to intermediate, as well as permanent faults. The proposed solution allows selecting the maximum possible fault coverage, in respect to the application’s timing, power and area specifications.

The rest of the paper is organized as follows: Section 2 describes the motivation for this work, while Section 3 gives a number of alternative fault tolerant techniques. Section 4 introduces the proposed methodology. Experimental results that shown the efficiency of our solution are provided in Section 5. Finally, conclusions are summarized in Section 6.

Section snippets

Motivation example

Throughout this paper, we investigate the impact of Negative Bias Temperature Instability (NBTI) physical degradation to FPGAs. This phenomenon has recently gained a lot of attention due to its increasingly adverse impact on nanometer CMOS technology. NBTI is typically seen as a threshold voltage shift after a negative bias has been applied to a MOS gate at elevated temperature. This phenomenon mainly affects pMOS transistors, while degradation of channel carrier mobility is also observed [33].

Alternative instantiations of TMR

This section introduces three candidate TMR-based techniques that trade-off the efficiency in fault masking with the consequet delay, power and area overheads. However, in advance of proceeding to these solutions, we provide an overview of the underline FPGA device [37], [38].

The employed reconfigurable architecture consists of an array of slices, each of which includes a Configurable Logic Block (CLB) and the surrounding routing infrastructure. The next level of hierarchy assumes that CLBs are

Proposed methodology for supporting low-cost fault masking

This section introduces the proposed methodology, as well as the corresponding algorithms, that enables application implementation with the maximum affordable (in respect to the system’s specifications) fault coverage against upsets occurred due to aging phenomena. This methodology, depicted in Fig. 5, is software supported by a number of new and existing CAD tools. More specifically, the developed tools are public available through [41], whereas in order to support commercial devices, these

Experimental results

This section provides a number of experimental results that show the efficiency of proposed methodology. For evaluation purposes, we employ a number of industrial oriented kernels from [34], [48]. Table 1 summarizes the complexity of employed kernels in term of 4-input LUTs. The target device is an Altera Stratix-based FPGA [37], [38]. Since we aim to general-purpose architectures, this device does not incorporate any dedicated fault tolerant mechanism. Regarding the number of injected upsets

Conclusion

A novel framework for supporting efficient fault masking against upsets occurred due to reliability degradation, was introduced. Rather than similar approaches that protect the entire design, our solution provides a trade-off between the desired fault masking and the consequent mitigation cost due to replica hardware. Experimental results with a number of industrial oriented kernels shown that the proposed framework outperforms similar solutions, since it achieves comparable fault coverage to

References (48)

  • S. Mahapatra et al.

    Negative bias temperature instability in cmos devices

    Microelectronic Engineering

    (2005)
  • V. Kalenteridis et al.

    A complete platform and toolset for system implementation on fine-grain reconfigurable hardware

    Microprocessors and Microsystems

    (2005)
  • Xilinx fpga devices....
  • Altera fpga devices....
  • A. Lesea, M. Alexander, Powering xilinx fpgas, Tech. Rep. XAPP158, Xilinx,...
  • ITRS, International technology roadmap for semiconductors, yield enhancement, Tech. Rep....
  • S. Borkar

    Design challenges of technology scaling

    Micro, IEEE

    (1999)
  • Altera, Introducing innovations at 28 nm to move beyond moores law, White paper, 2010....
  • T. Sakurai et al.

    Alpha-power law mosfet model and its applications to cmos inverter delay and other formulas

    IEEE Journal of Solid-State Circuits

    (1990)
  • J. Srinivasan et al.

    The impact of technology scaling on lifetime reliability

  • R. Doering et al.

    Handbook of Semiconductor Manufacturing Technology

    (2008)
  • Z. Lu et al.

    Interconnect lifetime prediction for reliability-aware systems

    IEEE Transactions on Very Large Scale Integration (VLSI) Systems

    (2007)
  • S. Srinivasan, P. Mangalagiri, Y. Xie, N. Viiaykrishnan, K. Sarpatwari, Flaw: Fpga lifetime awareness, in: Design...
  • F. Kastensmidt et al.

    Fault-Tolerance Techniques for SRAM-Based FPGAs (Frontiers in Electronic Testing)

    (2006)
  • J.A. Cheatham et al.

    A survey of fault tolerant methodologies for fpgas

    ACM Transactions on Design Automation of Electronic Systems

    (2006)
  • A. Yu, G. Lemieux, Defect-tolerant fpga switch block and connection block with fine-grain redundancy for yield...
  • N. Campregher et al.

    Analysis of yield loss due to random photolithographic defects in the interconnect structure of fpgas

  • R. Jain, A. Mukherjee, K. Paul, Defect-aware design paradigm for reconfigurable architectures, in: 2006 IEEE Computer...
  • I. Koren et al.

    Fault-Tolerant Systems

    (2007)
  • F. de Lima Kastensmidt et al.

    Designing fault-tolerant techniques for sram-based fpgas

    IEEE Design Test of Computers

    (2004)
  • A. Jacobs, A. George, G. Cieslewski, Reconfigurable fault tolerance: A framework for environmentally adaptive fault...
  • M. Zhou, L. Shang, Y. Hu, Reliability optimization of reconfigurable computing-based fault-tolerant system, in: 11th...
  • R. Rubin et al.

    Choose-your-own-adventure routing: lightweight load-time defect avoidance

  • A. Doumar, S. Kaneko, H. Ito, Defect and fault tolerance fpgas by shifting the configuration data, in: International...
  • Cited by (2)

    • A novel BRAM content accessing and processing method based on FPGA configuration bitstream

      2017, Microprocessors and Microsystems
      Citation Excerpt :

      For these reasons, a trade-off decision must be made in the design process. The most utilized redundancy level is the Triple Modular Redundancy (TMR) [16,17]. However, in this approach a critical problem may occur if an upset affects the voter, since an erroneous output can be deemed correct.

    • TEACHER: Teach advanced reconfigurable architectures and tools

      2015, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View full text