Abstract:
Low-level hardware faults manifested in a Deep learning (DL) accelerator usher in graceless degradation of high-level classification accuracy, which can eventuate to cata...Show MoreMetadata
Abstract:
Low-level hardware faults manifested in a Deep learning (DL) accelerator usher in graceless degradation of high-level classification accuracy, which can eventuate to catastrophic circumstances. This violates the crucial Functional Safety (FuSa) of the DL accelerator, maintaining which is imperative in high-assurance applications. Conventional techniques for error localization incur high-test efforts, without regards to the unique challenges posed by DL systems. In this direction, we propose DiagNNose, a two-tier machine learning-based error localization framework for on-line fault management in DL accelerators. We develop a novel diagnostic pattern selection algorithm to obtain a minimal subset of functional test patterns, that are executed in the accelerator in mission mode. By extracting and analyzing dataflow-based features from the intermediate computations of the general matrix multiply (GEMM) core, a lightweight multilayer perceptron accomplishes bit-level error localization in 8-bit, 16-bit, and 32-bit datapath units with high fidelity. We have limited ourselves to a single accelerator design, i.e., the versatile tensor accelerator (VTA) architecture to evaluate our proposed DiagNNose framework. On executing state-of-the-art deep neural networks trained on ImageNet; error localization using only 30 diagnostic functional test patterns demonstrate up to 98.4% diagnosability, thereby demonstrating an improvement of 54.63% over a random test pattern set, with as low as 4.95% overhead in the DL accelerator in mission mode.
Published in: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( Volume: 43, Issue: 1, January 2024)