A low-cost fault tolerant solution targeting commercial FPGA devices
Introduction
More than at any other time, global economics favor programmable chips over costly Application-Specific Integrated Circuits (ASICs) and Application-Specific Standard Products (ASSPs), since the costs and risks associated with application-specific devices can only be justified for a short list of ultra-high volume commodity products. Hence, programmable platforms, and more specifically Field-Programmable Gate Arrays (FPGAs), have become the only viable means for today’s companies to meet increasingly stringent product requirements – cost, power, performance, and density – in a business environment characterized by spiraling complexity, shrinking market windows, fickle market demands, capped engineering budgets, escalating ASIC and ASSP non-recurring engineering costs, and increased risk.
Recently, a number of FPGA platforms with increased performance and densities were released from reconfigurable industry (e.g. Virtex-6, Virtex-7, EasyPath series, Stratix-IV, Stratix-V, etc.) [1], [2]. Even though these devices exhibit superior system-level functionality, they still consume more power, as compared to the rest hardware implementations (e.g., ASICs, ASSPs) [3].
Meeting power, and consequently thermal budget, is an essential criterion by which customers measure the success of their FPGA-based designs. The importance of this problem becomes far more savage because power density in FPGA devices is almost doubled every three years [4], [5], [6], while this trend is expected to be increased with technology scaling according to “A-power” law [7]. Among others, higher power and thermal stress imposes that target architecture is more likely to suffer from reliability degradation, and hence to decrease the mean time between failures (MTBF) [8], [9], [10], [11]. For instance, regarding commercial grade FPGAs, the maximum die temperature without performance degradation is reported as 80 °C, whereas the absolute maximum temperature is 125 °C [3]. Furthermore, based on previously published data, an average-sized design mapped onto a Virtex-E FPGA with 90% device utilization could lead to die temperature of 50 °C above the ambient temperature value [3].
In order to highlight the importance of reliability degradation, a number of fault tolerant systems have been proposed. These solutions allow a design to continue its operation, possibly at a reduced level, rather than failing completely, when some part of the system fails.
Existing approaches in designing fault tolerant systems are applicable either in hardware, or software level. Previous analysis shown that hardware-based fault tolerance exhibits superior performance, as compared to alternative (software-based) implementations, but they impose increased fabrication cost [12], [13], [14], [15], [16], [17], [18], [19], [20]. The derived platforms exhibit static fault masking, which is defined at fabrication time. On the other hand, potentially the software-based fault tolerant solutions combine the required dependability level with the low cost of commodity devices [21], [22], [23], [24], [25].
Apart from novel methodologies, a number of CAD tools that software support these methodologies, have also to be developed [21], [22], [23], [25]. These approaches affect mainly academic solutions, while up to now the the only known commercially available framework for providing fault masking at FPGA, is the Xilinx Triple Modular Redundancy (TMR) [26]. The principle idea of TMR is the usage of hardware redundancy to mask any single failure by voting on the result of three identical copies of the circuit.
Even though available solutions, and especially Xilinx TMR, provide the maximum fault coverage both at combinational and sequential logic, the consequent mitigation cost (in terms of power consumption, delay degradation and area overhead) indicates an increased awareness that this approach is acceptable only for mission critical systems [12]. However, the technology scaling imposes that upsets due to reliability degradation are also critical for the majority of consumer products [5], [8], [9], [10]. Hence, novel techniques that could provide sufficient fault coverage with the minimum mitigation cost, are absolutely required.
As an improvement to Xilinx TMR, a number of methodologies and CAD tools have been proposed [27], [28], [29], [30], [31]. These solutions cluster hardware resources based on their sensitivity to errors, and then apply redundancy selectively only to suspicious portions of the design. Even though these approaches lead to superior performance compared to Xilinx TMR, existing solutions focus solely on masking Single Event Upsets (SEUs) [32]. On the other hand, throughout this paper, we introduce a software-supported framework, targeting both to intermediate, as well as permanent faults. The proposed solution allows selecting the maximum possible fault coverage, in respect to the application’s timing, power and area specifications.
The rest of the paper is organized as follows: Section 2 describes the motivation for this work, while Section 3 gives a number of alternative fault tolerant techniques. Section 4 introduces the proposed methodology. Experimental results that shown the efficiency of our solution are provided in Section 5. Finally, conclusions are summarized in Section 6.
Section snippets
Motivation example
Throughout this paper, we investigate the impact of Negative Bias Temperature Instability (NBTI) physical degradation to FPGAs. This phenomenon has recently gained a lot of attention due to its increasingly adverse impact on nanometer CMOS technology. NBTI is typically seen as a threshold voltage shift after a negative bias has been applied to a MOS gate at elevated temperature. This phenomenon mainly affects pMOS transistors, while degradation of channel carrier mobility is also observed [33].
Alternative instantiations of TMR
This section introduces three candidate TMR-based techniques that trade-off the efficiency in fault masking with the consequet delay, power and area overheads. However, in advance of proceeding to these solutions, we provide an overview of the underline FPGA device [37], [38].
The employed reconfigurable architecture consists of an array of slices, each of which includes a Configurable Logic Block (CLB) and the surrounding routing infrastructure. The next level of hierarchy assumes that CLBs are
Proposed methodology for supporting low-cost fault masking
This section introduces the proposed methodology, as well as the corresponding algorithms, that enables application implementation with the maximum affordable (in respect to the system’s specifications) fault coverage against upsets occurred due to aging phenomena. This methodology, depicted in Fig. 5, is software supported by a number of new and existing CAD tools. More specifically, the developed tools are public available through [41], whereas in order to support commercial devices, these
Experimental results
This section provides a number of experimental results that show the efficiency of proposed methodology. For evaluation purposes, we employ a number of industrial oriented kernels from [34], [48]. Table 1 summarizes the complexity of employed kernels in term of 4-input LUTs. The target device is an Altera Stratix-based FPGA [37], [38]. Since we aim to general-purpose architectures, this device does not incorporate any dedicated fault tolerant mechanism. Regarding the number of injected upsets
Conclusion
A novel framework for supporting efficient fault masking against upsets occurred due to reliability degradation, was introduced. Rather than similar approaches that protect the entire design, our solution provides a trade-off between the desired fault masking and the consequent mitigation cost due to replica hardware. Experimental results with a number of industrial oriented kernels shown that the proposed framework outperforms similar solutions, since it achieves comparable fault coverage to
References (48)
- et al.
Negative bias temperature instability in cmos devices
Microelectronic Engineering
(2005) - et al.
A complete platform and toolset for system implementation on fine-grain reconfigurable hardware
Microprocessors and Microsystems
(2005) - Xilinx fpga devices....
- Altera fpga devices....
- A. Lesea, M. Alexander, Powering xilinx fpgas, Tech. Rep. XAPP158, Xilinx,...
- ITRS, International technology roadmap for semiconductors, yield enhancement, Tech. Rep....
Design challenges of technology scaling
Micro, IEEE
(1999)- Altera, Introducing innovations at 28 nm to move beyond moores law, White paper, 2010....
- et al.
Alpha-power law mosfet model and its applications to cmos inverter delay and other formulas
IEEE Journal of Solid-State Circuits
(1990) - et al.
The impact of technology scaling on lifetime reliability
Handbook of Semiconductor Manufacturing Technology
Interconnect lifetime prediction for reliability-aware systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Fault-Tolerance Techniques for SRAM-Based FPGAs (Frontiers in Electronic Testing)
A survey of fault tolerant methodologies for fpgas
ACM Transactions on Design Automation of Electronic Systems
Analysis of yield loss due to random photolithographic defects in the interconnect structure of fpgas
Fault-Tolerant Systems
Designing fault-tolerant techniques for sram-based fpgas
IEEE Design Test of Computers
Choose-your-own-adventure routing: lightweight load-time defect avoidance
Cited by (2)
A novel BRAM content accessing and processing method based on FPGA configuration bitstream
2017, Microprocessors and MicrosystemsCitation Excerpt :For these reasons, a trade-off decision must be made in the design process. The most utilized redundancy level is the Triple Modular Redundancy (TMR) [16,17]. However, in this approach a critical problem may occur if an upset affects the voter, since an erroneous output can be deemed correct.
TEACHER: Teach advanced reconfigurable architectures and tools
2015, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)