Loading [a11y]/accessibility-menu.js
Performance-aware reliability assessment of heterogeneous chips | IEEE Conference Publication | IEEE Xplore

Performance-aware reliability assessment of heterogeneous chips


Abstract:

Technology evolution has raised serious reliability considerations, as transistor dimensions shrink and modern microprocessors become denser and more vulnerable to faults...Show More

Abstract:

Technology evolution has raised serious reliability considerations, as transistor dimensions shrink and modern microprocessors become denser and more vulnerable to faults. Reliability studies have proposed a plethora of methodologies for assessing system vulnerability which, however, highly rely on traditional reliability metrics that solely express failure rate over time. Although Failures In Time (FIT) is a very strong and representative reliability metric, it may fail to offer an objective comparison of highly diverse systems, such as CPUs against GPUs or other accelerators that are often employed to execute the same algorithms implemented for these platforms. In this paper, we propose a reliability evaluation methodology that takes into account the probability of a workload execution failure in order to compare heterogeneous systems, while we also capture the differences in the performance of these systems. We demonstrate the usefulness of the methodology with a test case scenario that compares the reliability and performance of three different commercial CPUs (different ISAs and microarchitectures) and one GPU. We use statistical fault injection to assess the vulnerability of the register file for the four computing systems of our study. The evaluation was performed using a comprehensive set of benchmarks with the same algorithms implemented for each individual system (serial code for the CPUs and parallel code for the GPU). Our findings show that, even though the GPU proves to be three orders of magnitude more vulnerable than CPUs using traditional reliability metrics, our performance-aware evaluation methodology shrinks this gap by 1-2 orders of magnitude providing more informative and realistic measurements to guide designers or programmers decisions.
Date of Conference: 09-12 April 2017
Date Added to IEEE Xplore: 18 May 2017
ISBN Information:
Electronic ISSN: 2375-1053
Conference Location: Las Vegas, NV, USA

Contact IEEE to Subscribe

References

References is not available for this document.