Abstract:
Silent Data Errors (SDEs) are a unique category of errors that result in unpredictable system behavior that is often difficult to detect. SDEs can represent a serious con...Show MoreMetadata
Abstract:
Silent Data Errors (SDEs) are a unique category of errors that result in unpredictable system behavior that is often difficult to detect. SDEs can represent a serious concern to at-scale compute in data center operations. [1], [2] This paper reviews data collected on SDE impacts to Artificial Intelligence (AI) workloads and Intel's SDE mitigation tools available for use in the data center.
Date of Conference: 14-18 April 2024
Date Added to IEEE Xplore: 16 May 2024
ISBN Information: