Abstract:
Due to technology scaling in modern computing platforms, the safety and reliability issues have increased tremendously, which often accelerate aging, lead to permanent fa...Show MoreMetadata
Abstract:
Due to technology scaling in modern computing platforms, the safety and reliability issues have increased tremendously, which often accelerate aging, lead to permanent faults, and cause unreliable execution of applications. Failure in some computing systems like avionics may cause catastrophic consequences. Therefore, managing reliability under all circumstances of stress and environmental changes is crucial in all abstraction layers, from application to transistor levels. Machine learning techniques are recently being employed for dynamic reliability estimation and optimization. They can adapt to varying workloads and system conditions. This paper presents reliability improvement approaches from multiple perspectives-from transistor-level to application-level-and discusses their effectiveness and limitations as well as open challenges.
Date of Conference: 17-19 April 2023
Date Added to IEEE Xplore: 02 June 2023
Print on Demand(PoD) ISBN:979-8-3503-9624-9