Abstract
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry.
In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 \(\times \) 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model \(f_\text {max}\). These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Braun, C., et al.: A-ABFT: Autonomous Algorithm-based Fault Tolerance for Matrix Multiplications on Graphics Processing Units. In: International Conference on Dependable Systems and Networks (DSN) (2014)
Davis, J.J., et al.: Datapath Fault Tolerance for Parallel Accelerators. In: International Conference on Field-Programmable Technology (FPT) (2013)
Davis, J.J., et al.: Achieving Low-overhead Fault Tolerance for Parallel Accelerators with Dynamic Partial Reconfiguration. In: International Conference on Field-programmable Logic and Applications (FPL) (2014)
Huang, K.H., et al.: Algorithm-based Fault Tolerance for Matrix Operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
Jacobs, A., et al.: Overhead and Reliability Analysis of Algorithm-based Fault Tolerance in FPGA systems. In: International Conference on Field Programmable Logic and Applications (FPL) (2012)
Rexford, J., et al.: Algorithm-based Fault Tolerance for Floating-point Operations in Massively Parallel Systems. In: International Symposium on Circuits and Systems (ISCAS), vol. 2 (1992)
Wang, S.J., et al.: Algorithm-based Fault Tolerance for FFT Networks. IEEE Trans. Comput. 43(7), 849–854 (1994)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Davis, J.J., Cheung, P.Y.K. (2016). Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators. In: Bonato, V., Bouganis, C., Gorgon, M. (eds) Applied Reconfigurable Computing. ARC 2016. Lecture Notes in Computer Science(), vol 9625. Springer, Cham. https://doi.org/10.1007/978-3-319-30481-6_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-30481-6_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30480-9
Online ISBN: 978-3-319-30481-6
eBook Packages: Computer ScienceComputer Science (R0)