Skip to main content

Advertisement

Log in

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Quinn H, Graham P (2005) Terrestrial-based radiation upsets: a cautionary tale. In: Symposium on field-programmable custom computing machines (FCCM) 2005. IEEE, pp 193–202. https://doi.org/10.1109/FCCM.2005.61

  2. Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. ACM SIGMETRICS Perform Eval Rev 37(1):193–204

    Article  Google Scholar 

  3. Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Not 47(4):111–122

    Article  Google Scholar 

  4. Egwutuoha IP, Levy D, Selic B, Chen S (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326

    Article  Google Scholar 

  5. Bougeret M, Casanova H, Robert Y et al (2014) Using group replication for resilience on exascale systems. Int J High Perform Comput Appl 28(2):210–224

    Article  Google Scholar 

  6. Tang X, Zhai J, Qian X et al (2018) vSensor: leveraging fixed-workload snippets of programs for performance variance detection. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp 124–136. https://doi.org/10.1145/3178487.3178497

  7. NAS Parallel Benchmarks Team (2021) NAS Parallel Benchmarks. https://www.nas.nasa.gov/publications/npb.html

  8. Dongarra J, Luszczek P, Heroux M (2021) HPCG. https://www.hpcg-benchmark.org/index.html

  9. Innovative Computing Laboratory (2021) University of Tennessee. HPCC. http://icl.cs.utk.edu/hpcc/

  10. Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–11. https://doi.org/10.1109/SC.2012.13

  11. Li S, Chen K, Hsieh M Y, et al (2011) System implications of memory reliability in exascale computing. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1145/2063384.2063445

  12. Somnath P, Fang C, Xinmiao Z, Swarup B (2010) Reliability-driven ECC allocation for multiple bit error resilience in processor cache. IEEE Trans Comput 60(1):20–34

    MathSciNet  MATH  Google Scholar 

  13. Kuang-Hua H, Abraham Jacob A (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528

    Article  Google Scholar 

  14. Hodjat H, Abbas V, Hassan MSA (2012) Analysis and evaluation of a new algorithm based fault tolerance for computing systems. Int J Grid High Perform Comput 4(1):37–51

    Article  Google Scholar 

  15. Sying-Jyan W, Jha Niraj K (1994) Algorithm-based fault tolerance for FFT networks. IEEE Trans Comput 43(7):849–854

    Article  Google Scholar 

  16. Bautista-Gomez L, Cappello F (2015) Exploiting spatial smoothness in hpc applications to detect silent data corruption. In: IEEE international symposium on IEEE international conference on high performance computing & communications. IEEE, pp 128–133. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.9

  17. Bautista-Gomez L, Cappello F (2015) Detecting silent data corruption for extreme-scale MPI applications. In: Proceedings of the 22nd European MPI Users’ Group Meeting, pp 1–10. https://doi.org/10.1145/2802658.2802665

  18. Leonardo B-G, Franck C (2014) Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Not 49(8):381–382

    Article  Google Scholar 

  19. Chen C, Eisenhauer G, Wolf M et al (2018) LADR: low-cost application-level detector for reducing silent output corruptions. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing, pp 156–167. https://doi.org/10.1145/3208040.3208043

  20. Berrocal E, Bautista-Gomez L, Di S et al (2015) Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp 275–278. https://doi.org/10.1145/2749246.2749253

  21. Berrocal E, Bautista-Gomez L, Di S et al (2017) Toward general software level silent data corruption detection for parallel applications. IEEE Trans Parallel Distrib Syst 28(12):3642–3655

    Article  Google Scholar 

  22. Berrocal E, Bautista-Gomez L, Di S et al (2016) Exploring partial replication to improve lightweight silent data corruption detection for HPC applications. In: Proceedings of european conference on parallel processing. Springer, Cham, pp 419–430. https://doi.org/10.1007/978-3-319-43659-3_313

  23. Subasi O, Di S, Balaprakash P et al (2017) MACORD: online adaptive machine learning framework for silent error detection. In: Proceedings of 2017 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 717–724. https://doi.org/10.1109/CLUSTER.2017.128

  24. Di S, Berrocal E, Cappello F (2015) An efficient silent data corruption detection method with error-feedback control and even sampling for hpc applications. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, pp 271–280. https://doi.org/10.1109/CCGrid.2015.17

  25. Sheng D, Franck C (2016) Adaptive impact-driven detection of silent data corruption for hpc applications. IEEE Trans Parallel Distrib Syst 27(10):2809–2823

    Article  Google Scholar 

  26. Li S, Di S, Zhao K, Liang X, Chen Z, Cappello F (2020) Towards end-to-end SDC detection for HPC applications equipped with lossy compression. In: 2020 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 326–336. https://doi.org/10.1109/CLUSTER49012.2020.00043

  27. Ferreira K, Riesen R, Oldfield R, Stearley J, Laros J, Pedretti K, Brightwell R (2011) rMPI: increasing fault resiliency in a message-passing environment, Sandia National Laboratories, Albuquerque, NM, Tech. Rep. SAND2011-2488. https://doi.org/10.2172/1012733

  28. Troy PL, Rakhi A, Edgar G, Jaspal SS (2009) Volpexmpi: an MPI library for execution of parallel applications on volatile nodes. European parallel virtual machine/message passing interface users’ group meeting. Springer, Berlin, Heidelberg, pp 124–133. https://doi.org/10.1007/978-3-642-03770-2_19

    Chapter  Google Scholar 

  29. Wang Z, Yang X, Zhou Y (2010) MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: 2010 10th IEEE international conference on computer and information technology. IEEE, pp 1251–1256. https://doi.org/10.1109/CIT.2010.226

  30. Engelmann C, Bhm S (2011) Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 15–17

  31. Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira KB (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE, pp 1–12. https://doi.org/10.1109/SC.2012.49

  32. Pérez D, Ropars T, Meneses E (2021) On the detection of silent data corruptions in HPC applications using redundant multi-threading. Euro-par: 2020 parallel processing workshops. Nature Publishing Group, 12480, 290. https://doi.org/10.1007/2F978-3-030-71593-9_23

  33. Mitropoulou K, Porpodas V, Jones TM (2016) COMET: communication-optimised multi-threaded error-detection technique. In: 2016 International conference on compliers, architectures, and synthesis of embedded systems (CASES). IEEE, pp 1–10. https://doi.org/10.1145/2968455.2968508

  34. Porter L, Laurenzano MA, Tiwari A et al (2015) Making the most of SMT in HPC: system-and application-level perspectives. ACM Trans Archit Code Optim (TACO) 11(4):1–26

    Article  Google Scholar 

  35. Cheynet P, Nicolescu B, Velazco R et al (2000) Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Trans Nucl Sci 47(6):2231–2236

    Article  Google Scholar 

  36. Benson AR, Schmit S, Schreiber R (2015) Silent error detection in numerical time-stepping schemes. Int J High Perform Comput Appl 29(4):403–421

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by National Key R&D Program of China under (Grant No. 2020YFB1506703) and National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, G., Liu, Y., Yang, H. et al. Efficient detection of silent data corruption in HPC applications with synchronization-free message verification. J Supercomput 78, 1381–1408 (2022). https://doi.org/10.1007/s11227-021-03892-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03892-4

Keywords

Navigation