Conferences >2023 IEEE European Test Sympo...

Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Hardware failures pose critical threats to deep neural network (DNN) training workloads, and the urgency of tackling this challenge (known as the Silent Data Corruption c...Show More

Metadata

Abstract:

Hardware failures pose critical threats to deep neural network (DNN) training workloads, and the urgency of tackling this challenge (known as the Silent Data Corruption challenge in a broader context) has been raised widely by the industry. Based on industry reports, a large number of the failures observed in real systems are permanent hardware failures in logic. However, there is a very limited understanding of the effects that these failures can impose on DNN training workloads. In this paper, we present the first resilience study on this subject, focusing on deep learning (DL) training accelerator systems. We developed a fault injection framework to accurately simulate the effects of permanent faults, and conducted 100K fault injection experiments. Our results provide the fundamental understanding on how logic permanent hardware failures affect training workloads and eventually generate unexpected training outcomes. Based on this new knowledge, we developed efficient software-based detection and recovery techniques to mitigate logic permanent hardware failures that are likely to generate unexpected outcomes. Evaluation on Google Cloud TPUs shows that our techniques are effective and practical: they require 15−25 lines of code change, and introduce 0.004%−0.025% performance/energy overhead for various representative neural network models.

Published in: 2023 IEEE European Test Symposium (ETS)

Date of Conference: 22-26 May 2023

Date Added to IEEE Xplore: 12 July 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ETS56758.2023.10173972

Conference Location: Venezia, Italy

Contents

References is not available for this document.

Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?