ABSTRACT
This paper proposes a continuous health-check approach for detecting Silent Data Corruption (SCD) in High Performance Computing (HPC) environments. The goal is to minimize the effect of hardware errors in the overall reliability and accuracy of the system by overseeing and validating the accuracy of data. Our work focuses on comparing and presenting the advantages and shortcomings of two approaches to overcoming SDC. Our research shows that from the two proposed methods - threshold triggered and continuous verification - the latter is superior in terms of latency.
- Leonardo Bautista Gomez and Franck Cappello. 2014. Detecting silent data corruption through data dynamic monitoring for scientific applications. In ACM SIGPLAN Notices, Vol. 49. ACM, 381--382. Google ScholarDigital Library
- Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, and Franck Cappello. 2015. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 275--278. Google ScholarDigital Library
- Cristian Constantinescu, Ishwar Parulkar, Rick Harper, and Sarah Michalak. 2008. Silent Data Corruption: Myth or reality?. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). IEEE, 108--109.Google ScholarCross Ref
- Sheng Di, Eduardo Berrocal, and Franck Cappello. 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 271--280. Google ScholarDigital Library
- Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2809--2823. Google ScholarDigital Library
- Moslem Didehban and Aviral Shrivastava. 2016. nZDC: A compiler technique for near Zero Silent Data Corruption. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google ScholarDigital Library
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 78. Google ScholarDigital Library
- Siva Kumar Sastry Hari, Sarita V Adve, and Helia Naeimi. 2012. Low-cost programlevel detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). IEEE, 1--12. Google ScholarDigital Library
- Zaeem Hussain, Taieb Znati, and Rami Melhem. 2018. Partial redundancy in HPC systems with non-uniform node reliabilities. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 44. Google ScholarDigital Library
- Erhan Yilmaz and Gilly Ladina. 2014. Redundancy and Reliability for an HPC Data Centre. http://www.prace-ri.eu/IMG/pdf/HPC-Centre-Redundancy-Reliability-WhitePaper.pdfGoogle Scholar
- Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for gpgpu. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 287--300. Google ScholarDigital Library
Index Terms
- Bi-Source Verification Against Silent Data Corruption in High Performance Computing
Recommendations
Detecting Silent Data Corruption for Extreme-Scale MPI Applications
EuroMPI '15: Proceedings of the 22nd European MPI Users' Group MeetingNext-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving ...
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
IPDPSW '11: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD ForumFaults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores, and this situation will only become more dire as we reach exascale computing. Exacerbating this situation, some of these faults ...
In-Situ Mitigation of Silent Data Corruption in PDE Solvers
FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme ScaleWe present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which ...
Comments