Abstract
Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.










Similar content being viewed by others
References
Quinn H, Graham P (2005) Terrestrial-based radiation upsets: a cautionary tale. In: Symposium on field-programmable custom computing machines (FCCM) 2005. IEEE, pp 193–202. https://doi.org/10.1109/FCCM.2005.61
Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. ACM SIGMETRICS Perform Eval Rev 37(1):193–204
Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Not 47(4):111–122
Egwutuoha IP, Levy D, Selic B, Chen S (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326
Bougeret M, Casanova H, Robert Y et al (2014) Using group replication for resilience on exascale systems. Int J High Perform Comput Appl 28(2):210–224
Tang X, Zhai J, Qian X et al (2018) vSensor: leveraging fixed-workload snippets of programs for performance variance detection. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp 124–136. https://doi.org/10.1145/3178487.3178497
NAS Parallel Benchmarks Team (2021) NAS Parallel Benchmarks. https://www.nas.nasa.gov/publications/npb.html
Dongarra J, Luszczek P, Heroux M (2021) HPCG. https://www.hpcg-benchmark.org/index.html
Innovative Computing Laboratory (2021) University of Tennessee. HPCC. http://icl.cs.utk.edu/hpcc/
Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–11. https://doi.org/10.1109/SC.2012.13
Li S, Chen K, Hsieh M Y, et al (2011) System implications of memory reliability in exascale computing. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1145/2063384.2063445
Somnath P, Fang C, Xinmiao Z, Swarup B (2010) Reliability-driven ECC allocation for multiple bit error resilience in processor cache. IEEE Trans Comput 60(1):20–34
Kuang-Hua H, Abraham Jacob A (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528
Hodjat H, Abbas V, Hassan MSA (2012) Analysis and evaluation of a new algorithm based fault tolerance for computing systems. Int J Grid High Perform Comput 4(1):37–51
Sying-Jyan W, Jha Niraj K (1994) Algorithm-based fault tolerance for FFT networks. IEEE Trans Comput 43(7):849–854
Bautista-Gomez L, Cappello F (2015) Exploiting spatial smoothness in hpc applications to detect silent data corruption. In: IEEE international symposium on IEEE international conference on high performance computing & communications. IEEE, pp 128–133. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.9
Bautista-Gomez L, Cappello F (2015) Detecting silent data corruption for extreme-scale MPI applications. In: Proceedings of the 22nd European MPI Users’ Group Meeting, pp 1–10. https://doi.org/10.1145/2802658.2802665
Leonardo B-G, Franck C (2014) Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Not 49(8):381–382
Chen C, Eisenhauer G, Wolf M et al (2018) LADR: low-cost application-level detector for reducing silent output corruptions. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing, pp 156–167. https://doi.org/10.1145/3208040.3208043
Berrocal E, Bautista-Gomez L, Di S et al (2015) Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp 275–278. https://doi.org/10.1145/2749246.2749253
Berrocal E, Bautista-Gomez L, Di S et al (2017) Toward general software level silent data corruption detection for parallel applications. IEEE Trans Parallel Distrib Syst 28(12):3642–3655
Berrocal E, Bautista-Gomez L, Di S et al (2016) Exploring partial replication to improve lightweight silent data corruption detection for HPC applications. In: Proceedings of european conference on parallel processing. Springer, Cham, pp 419–430. https://doi.org/10.1007/978-3-319-43659-3_313
Subasi O, Di S, Balaprakash P et al (2017) MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of 2017 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 717–724. https://doi.org/10.1109/CLUSTER.2017.128
Di S, Berrocal E, Cappello F (2015) An efficient silent data corruption detection method with error-feedback control and even sampling for hpc applications. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, pp 271–280. https://doi.org/10.1109/CCGrid.2015.17
Sheng D, Franck C (2016) Adaptive impact-driven detection of silent data corruption for hpc applications. IEEE Trans Parallel Distrib Syst 27(10):2809–2823
Li S, Di S, Zhao K, Liang X, Chen Z, Cappello F (2020) Towards end-to-end SDC detection for HPC applications equipped with lossy compression. In: 2020 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 326–336. https://doi.org/10.1109/CLUSTER49012.2020.00043
Ferreira K, Riesen R, Oldfield R, Stearley J, Laros J, Pedretti K, Brightwell R (2011) rMPI: increasing fault resiliency in a message-passing environment, Sandia National Laboratories, Albuquerque, NM, Tech. Rep. SAND2011-2488. https://doi.org/10.2172/1012733
Troy PL, Rakhi A, Edgar G, Jaspal SS (2009) Volpexmpi: an MPI library for execution of parallel applications on volatile nodes. European parallel virtual machine/message passing interface users’ group meeting. Springer, Berlin, Heidelberg, pp 124–133. https://doi.org/10.1007/978-3-642-03770-2_19
Wang Z, Yang X, Zhou Y (2010) MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: 2010 10th IEEE international conference on computer and information technology. IEEE, pp 1251–1256. https://doi.org/10.1109/CIT.2010.226
Engelmann C, Bhm S (2011) Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 15–17
Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira KB (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1109/SC.2012.49
Pérez D, Ropars T, Meneses E (2021) On the detection of silent data corruptions in HPC applications using redundant multi-threading. Euro-par: 2020 parallel processing workshops. Nature Publishing Group, 12480, 290. https://doi.org/10.1007/2F978-3-030-71593-9_23
Mitropoulou K, Porpodas V, Jones TM (2016) COMET: communication-optimised multi-threaded error-detection technique. In: 2016 International conference on compliers, architectures, and synthesis of embedded systems (CASES). IEEE, pp 1–10. https://doi.org/10.1145/2968455.2968508
Porter L, Laurenzano MA, Tiwari A et al (2015) Making the most of SMT in HPC: system-and application-level perspectives. ACM Trans Archit Code Optim (TACO) 11(4):1–26
Cheynet P, Nicolescu B, Velazco R et al (2000) Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Trans Nucl Sci 47(6):2231–2236
Benson AR, Schmit S, Schreiber R (2015) Silent error detection in numerical time-stepping schemes. Int J High Perform Comput Appl 29(4):403–421
Acknowledgements
This work has been supported by National Key R&D Program of China under (Grant No. 2020YFB1506703) and National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002). Hailong Yang is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, G., Liu, Y., Yang, H. et al. Efficient detection of silent data corruption in HPC applications with synchronization-free message verification. J Supercomput 78, 1381–1408 (2022). https://doi.org/10.1007/s11227-021-03892-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03892-4