Abstract
This paper studies the use of Redundant Multi-Threading (RMT) to detect Silent Data Corruptions in HPC applications. To understand if it can be a viable solution in an HPC context, we study two software optimizations to reduce RMT performance overhead by reducing the amount of data exchanged between the replicated threads. We conduct experiments with representative HPC workloads to measure the performance gains obtained through these optimizations, and the error detection coverage they achieve. In the best case, when running on a processor that features Simultaneous Multi-Threading, our results show that the overhead can be as low as 1.4\(\times \) without significantly reducing the ability to detect data corruptions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Results not included in the paper due to the limited space.
- 2.
The results presented in Sect. 4 are obtained with \(K=16\). Our tests showed that this the value that leads to the best performance for our applications.
- 3.
Trying to take advantage of the SMT threads by running 40 ranks in the non-replicated run does not provide any performance improvement.
- 4.
We also tested configurations with 10 ranks when RMT was used to have one thread per core. In most cases the performance were equivalent to the results with the default configuration.
- 5.
Tested applications in [6] are part of different benchmark suites than ours.
- 6.
These results are not specific to the selected problem size. They remain equivalent for other problem sizes in both applications.
References
Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., McIntosh-Smith, S.: Unprotected computing: a large-scale study of DRAM raw error rate on a supercomputer. In: IEEE Supercomputing (2016)
Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: ACM HPDC (2015)
Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., Cappello, F.: Toward general software level silent data corruption detection for parallel applications. IEEE TPDS 28(12), 3642–3655 (2017)
Calhoun, J., Olson, L., Snir, M.: FlipIt: an LLVM based fault injector for HPC. In: EuroPar (2014)
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: IEEE Supercomputing (2012)
Kuvaiskii, D., Faqeh, R., Bhatotia, P., Felber, P., Fetzer, C.: HAFT: hardware-assisted fault tolerance. In: ACM Eurosys (2016)
Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: IPAS: intelligent protection against silent output corruption in scientific applications. In: CGO (2016)
Mitropoulou, K., Porpodas, V., Jones, T.M.: COMET: communication-optimised multi-threaded error-detection technique. In: ACM CASES (2016)
Mitropoulou, K., Porpodas, V., Zhang, X., Jones, T.M.: Lynx: using OS and hardware support for fast fine-grained inter-core communication. In: ACM ICS (2016)
Porter, L., et al.: Making the most of SMT in HPC: system-and application-level perspectives. ACM TACO 11(4) (2015)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: IEEE Symposium on Computer Architecture (2000)
Snir, M., et al.: Addressing failures in exascale computing. IJHPCA 28(2), 129–173 (2014)
Sridharan, V., et al.: Memory errors in modern systems: the good, the bad, and the ugly. ACM SIGPLAN Notices 50, 297–310 (2015)
Wang, C., Kim, H.S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: ACM CGO (2007)
Yu, J., Garzaran, M.J., Snir, M.: ESoftCheck: removal of non-vital checks for fault tolerance. In: CGO (2009)
Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: decoupled acyclic fault tolerance. Int. J. Parallel Prog. 40(1), 118–140 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Pérez, D., Ropars, T., Meneses, E. (2021). On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading. In: Balis, B., et al. Euro-Par 2020: Parallel Processing Workshops. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12480. Springer, Cham. https://doi.org/10.1007/978-3-030-71593-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-71593-9_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71592-2
Online ISBN: 978-3-030-71593-9
eBook Packages: Computer ScienceComputer Science (R0)