Loading [MathJax]/extensions/MathMenu.js
Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems | IEEE Conference Publication | IEEE Xplore

Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems


Abstract:

Hardware failures pose critical threats to deep neural network (DNN) training workloads, and the urgency of tackling this challenge (known as the Silent Data Corruption c...Show More

Abstract:

Hardware failures pose critical threats to deep neural network (DNN) training workloads, and the urgency of tackling this challenge (known as the Silent Data Corruption challenge in a broader context) has been raised widely by the industry. Based on industry reports, a large number of the failures observed in real systems are permanent hardware failures in logic. However, there is a very limited understanding of the effects that these failures can impose on DNN training workloads. In this paper, we present the first resilience study on this subject, focusing on deep learning (DL) training accelerator systems. We developed a fault injection framework to accurately simulate the effects of permanent faults, and conducted 100K fault injection experiments. Our results provide the fundamental understanding on how logic permanent hardware failures affect training workloads and eventually generate unexpected training outcomes. Based on this new knowledge, we developed efficient software-based detection and recovery techniques to mitigate logic permanent hardware failures that are likely to generate unexpected outcomes. Evaluation on Google Cloud TPUs shows that our techniques are effective and practical: they require 15−25 lines of code change, and introduce 0.004%−0.025% performance/energy overhead for various representative neural network models.
Date of Conference: 22-26 May 2023
Date Added to IEEE Xplore: 12 July 2023
ISBN Information:

ISSN Information:

Conference Location: Venezia, Italy

Contact IEEE to Subscribe

References

References is not available for this document.