Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

Zhang, Guozhen; Liu, Yi; Yang, Hailong; Qian, Depei

doi:10.1007/s11227-021-03892-4

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

Published: 09 June 2021

Volume 78, pages 1381–1408, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Guozhen Zhang¹,
Yi Liu¹,
Hailong Yang ORCID: orcid.org/0000-0003-1101-7927¹ &
…
Depei Qian¹

319 Accesses
2 Citations
Explore all metrics

Abstract

Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

References

Quinn H, Graham P (2005) Terrestrial-based radiation upsets: a cautionary tale. In: Symposium on field-programmable custom computing machines (FCCM) 2005. IEEE, pp 193–202. https://doi.org/10.1109/FCCM.2005.61
Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. ACM SIGMETRICS Perform Eval Rev 37(1):193–204
Article Google Scholar
Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Not 47(4):111–122
Article Google Scholar
Egwutuoha IP, Levy D, Selic B, Chen S (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326
Article Google Scholar
Bougeret M, Casanova H, Robert Y et al (2014) Using group replication for resilience on exascale systems. Int J High Perform Comput Appl 28(2):210–224
Article Google Scholar
Tang X, Zhai J, Qian X et al (2018) vSensor: leveraging fixed-workload snippets of programs for performance variance detection. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp 124–136. https://doi.org/10.1145/3178487.3178497
NAS Parallel Benchmarks Team (2021) NAS Parallel Benchmarks. https://www.nas.nasa.gov/publications/npb.html
Dongarra J, Luszczek P, Heroux M (2021) HPCG. https://www.hpcg-benchmark.org/index.html
Innovative Computing Laboratory (2021) University of Tennessee. HPCC. http://icl.cs.utk.edu/hpcc/
Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–11. https://doi.org/10.1109/SC.2012.13
Li S, Chen K, Hsieh M Y, et al (2011) System implications of memory reliability in exascale computing. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1145/2063384.2063445
Somnath P, Fang C, Xinmiao Z, Swarup B (2010) Reliability-driven ECC allocation for multiple bit error resilience in processor cache. IEEE Trans Comput 60(1):20–34
MathSciNet MATH Google Scholar
Kuang-Hua H, Abraham Jacob A (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528
Article Google Scholar
Hodjat H, Abbas V, Hassan MSA (2012) Analysis and evaluation of a new algorithm based fault tolerance for computing systems. Int J Grid High Perform Comput 4(1):37–51
Article Google Scholar
Sying-Jyan W, Jha Niraj K (1994) Algorithm-based fault tolerance for FFT networks. IEEE Trans Comput 43(7):849–854
Article Google Scholar
Bautista-Gomez L, Cappello F (2015) Exploiting spatial smoothness in hpc applications to detect silent data corruption. In: IEEE international symposium on IEEE international conference on high performance computing & communications. IEEE, pp 128–133. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.9
Bautista-Gomez L, Cappello F (2015) Detecting silent data corruption for extreme-scale MPI applications. In: Proceedings of the 22nd European MPI Users’ Group Meeting, pp 1–10. https://doi.org/10.1145/2802658.2802665
Leonardo B-G, Franck C (2014) Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Not 49(8):381–382
Article Google Scholar
Chen C, Eisenhauer G, Wolf M et al (2018) LADR: low-cost application-level detector for reducing silent output corruptions. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing, pp 156–167. https://doi.org/10.1145/3208040.3208043
Berrocal E, Bautista-Gomez L, Di S et al (2015) Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp 275–278. https://doi.org/10.1145/2749246.2749253
Berrocal E, Bautista-Gomez L, Di S et al (2017) Toward general software level silent data corruption detection for parallel applications. IEEE Trans Parallel Distrib Syst 28(12):3642–3655
Article Google Scholar
Berrocal E, Bautista-Gomez L, Di S et al (2016) Exploring partial replication to improve lightweight silent data corruption detection for HPC applications. In: Proceedings of european conference on parallel processing. Springer, Cham, pp 419–430. https://doi.org/10.1007/978-3-319-43659-3_313
Subasi O, Di S, Balaprakash P et al (2017) MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of 2017 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 717–724. https://doi.org/10.1109/CLUSTER.2017.128
Di S, Berrocal E, Cappello F (2015) An efficient silent data corruption detection method with error-feedback control and even sampling for hpc applications. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, pp 271–280. https://doi.org/10.1109/CCGrid.2015.17
Sheng D, Franck C (2016) Adaptive impact-driven detection of silent data corruption for hpc applications. IEEE Trans Parallel Distrib Syst 27(10):2809–2823
Article Google Scholar
Li S, Di S, Zhao K, Liang X, Chen Z, Cappello F (2020) Towards end-to-end SDC detection for HPC applications equipped with lossy compression. In: 2020 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 326–336. https://doi.org/10.1109/CLUSTER49012.2020.00043
Ferreira K, Riesen R, Oldfield R, Stearley J, Laros J, Pedretti K, Brightwell R (2011) rMPI: increasing fault resiliency in a message-passing environment, Sandia National Laboratories, Albuquerque, NM, Tech. Rep. SAND2011-2488. https://doi.org/10.2172/1012733
Troy PL, Rakhi A, Edgar G, Jaspal SS (2009) Volpexmpi: an MPI library for execution of parallel applications on volatile nodes. European parallel virtual machine/message passing interface users’ group meeting. Springer, Berlin, Heidelberg, pp 124–133. https://doi.org/10.1007/978-3-642-03770-2_19
Chapter Google Scholar
Wang Z, Yang X, Zhou Y (2010) MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: 2010 10th IEEE international conference on computer and information technology. IEEE, pp 1251–1256. https://doi.org/10.1109/CIT.2010.226
Engelmann C, Bhm S (2011) Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 15–17
Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira KB (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1109/SC.2012.49
Pérez D, Ropars T, Meneses E (2021) On the detection of silent data corruptions in HPC applications using redundant multi-threading. Euro-par: 2020 parallel processing workshops. Nature Publishing Group, 12480, 290. https://doi.org/10.1007/2F978-3-030-71593-9_23
Mitropoulou K, Porpodas V, Jones TM (2016) COMET: communication-optimised multi-threaded error-detection technique. In: 2016 International conference on compliers, architectures, and synthesis of embedded systems (CASES). IEEE, pp 1–10. https://doi.org/10.1145/2968455.2968508
Porter L, Laurenzano MA, Tiwari A et al (2015) Making the most of SMT in HPC: system-and application-level perspectives. ACM Trans Archit Code Optim (TACO) 11(4):1–26
Article Google Scholar
Cheynet P, Nicolescu B, Velazco R et al (2000) Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Trans Nucl Sci 47(6):2231–2236
Article Google Scholar
Benson AR, Schmit S, Schreiber R (2015) Silent error detection in numerical time-stepping schemes. Int J High Perform Comput Appl 29(4):403–421
Article Google Scholar

Download references

Acknowledgements

This work has been supported by National Key R&D Program of China under (Grant No. 2020YFB1506703) and National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

Sino-German Joint Software Institute, Beihang University, Beijing, China
Guozhen Zhang, Yi Liu, Hailong Yang & Depei Qian

Authors

Guozhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, G., Liu, Y., Yang, H. et al. Efficient detection of silent data corruption in HPC applications with synchronization-free message verification. J Supercomput 78, 1381–1408 (2022). https://doi.org/10.1007/s11227-021-03892-4

Download citation

Accepted: 13 May 2021
Published: 09 June 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11227-021-03892-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

Abstract

Access this article

Similar content being viewed by others

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

Abstract

Access this article

Similar content being viewed by others

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation