Abstract:
The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardwa...Show MoreMetadata
Abstract:
The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardware error detection. Simultaneously, the multi-core revolution has impelled software to become parallel. Therefore, there is a compelling need to protect parallel programs from hardware errors. Parallel programs' tasks have significant similarity in control data due to the use of high-level programming models. In this study, we propose BLOCKWATCH to leverage the similarity in parallel program's control data for detecting hardware errors. BLOCKWATCH statically extracts the similarity among different threads of a parallel program and checks the similarity at runtime. We evaluate BLOCKWATCH on seven SPLASH-2 benchmarks to measure its performance overhead and error detection coverage. We find that BLOCKWATCH incurs an average overhead of 16% across all programs, and provides an average SDC coverage of 97% for faults in the control data.
Date of Conference: 25-28 June 2012
Date Added to IEEE Xplore: 09 August 2012
ISBN Information: