skip to main content
10.1145/3351556.3351567acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbciConference Proceedingsconference-collections
short-paper

Bi-Source Verification Against Silent Data Corruption in High Performance Computing

Authors Info & Claims
Published:26 September 2019Publication History

ABSTRACT

This paper proposes a continuous health-check approach for detecting Silent Data Corruption (SCD) in High Performance Computing (HPC) environments. The goal is to minimize the effect of hardware errors in the overall reliability and accuracy of the system by overseeing and validating the accuracy of data. Our work focuses on comparing and presenting the advantages and shortcomings of two approaches to overcoming SDC. Our research shows that from the two proposed methods - threshold triggered and continuous verification - the latter is superior in terms of latency.

References

  1. Leonardo Bautista Gomez and Franck Cappello. 2014. Detecting silent data corruption through data dynamic monitoring for scientific applications. In ACM SIGPLAN Notices, Vol. 49. ACM, 381--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, and Franck Cappello. 2015. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 275--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cristian Constantinescu, Ishwar Parulkar, Rick Harper, and Sarah Michalak. 2008. Silent Data Corruption: Myth or reality?. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). IEEE, 108--109.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sheng Di, Eduardo Berrocal, and Franck Cappello. 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2809--2823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Moslem Didehban and Aviral Shrivastava. 2016. nZDC: A compiler technique for near Zero Silent Data Corruption. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Siva Kumar Sastry Hari, Sarita V Adve, and Helia Naeimi. 2012. Low-cost programlevel detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Zaeem Hussain, Taieb Znati, and Rami Melhem. 2018. Partial redundancy in HPC systems with non-uniform node reliabilities. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Erhan Yilmaz and Gilly Ladina. 2014. Redundancy and Reliability for an HPC Data Centre. http://www.prace-ri.eu/IMG/pdf/HPC-Centre-Redundancy-Reliability-WhitePaper.pdfGoogle ScholarGoogle Scholar
  11. Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for gpgpu. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 287--300. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bi-Source Verification Against Silent Data Corruption in High Performance Computing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        BCI'19: Proceedings of the 9th Balkan Conference on Informatics
        September 2019
        225 pages

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 September 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed limited

        Acceptance Rates

        BCI'19 Paper Acceptance Rate24of73submissions,33%Overall Acceptance Rate97of250submissions,39%
      • Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader