skip to main content
10.1145/2555243.2555279acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
poster

Detecting silent data corruption through data dynamic monitoring for scientific applications

Published: 06 February 2014 Publication History

Abstract

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead.

References

[1]
Shekhar Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25:10--16, November 2005.
[2]
Kuang-Hua Huang and Jacob A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100(6):518--528, 1984.
[3]
Dong Li, Jeffrey S Vetter, and Weikuan Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 57. IEEE Computer Society Press, 2012.
[4]
Tezzaron Semiconductor. Soft errors in electronic memory-a white paper, 2004.

Cited By

View all
  • (2021)Resilient Scheduling Heuristics for Rigid Parallel JobsInternational Journal of Networking and Computing10.15803/ijnc.11.1_211:1(2-26)Online publication date: 2021
  • (2020)Tracking scientific simulation using online time-series modelling2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-73(202-211)Online publication date: May-2020
  • (2019)Algorithm-Based Fault Tolerance for Parallel Stencil Computations2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891034(1-11)Online publication date: Sep-2019
  • Show More Cited By

Index Terms

  1. Detecting silent data corruption through data dynamic monitoring for scientific applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
    February 2014
    412 pages
    ISBN:9781450326568
    DOI:10.1145/2555243
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 February 2014

    Check for updates

    Author Tags

    1. bit flips
    2. data entropy
    3. fault tolerance
    4. silent data corruption
    5. soft errors
    6. supercomputers

    Qualifiers

    • Poster

    Conference

    PPoPP '14
    Sponsor:

    Acceptance Rates

    PPoPP '14 Paper Acceptance Rate 28 of 184 submissions, 15%;
    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Resilient Scheduling Heuristics for Rigid Parallel JobsInternational Journal of Networking and Computing10.15803/ijnc.11.1_211:1(2-26)Online publication date: 2021
    • (2020)Tracking scientific simulation using online time-series modelling2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-73(202-211)Online publication date: May-2020
    • (2019)Algorithm-Based Fault Tolerance for Parallel Stencil Computations2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891034(1-11)Online publication date: Sep-2019
    • (2019)Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00013(31-40)Online publication date: May-2019
    • (2018)Improving data integrity in linux software RAID with protection information (T10-PI)Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00091(609-615)Online publication date: 1-May-2018
    • (2017)Toward General Software Level Silent Data Corruption Detection for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.273597128:12(3642-3655)Online publication date: 1-Dec-2017
    • (2017)Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance AnalysisHigh Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation10.1007/978-3-319-72971-8_8(158-178)Online publication date: 23-Dec-2017
    • (2016)Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC ApplicationsProceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 983310.1007/978-3-319-43659-3_31(419-430)Online publication date: 24-Aug-2016
    • (2015)Detecting Silent Data Corruption for Extreme-Scale MPI ApplicationsProceedings of the 22nd European MPI Users' Group Meeting10.1145/2802658.2802665(1-10)Online publication date: 21-Sep-2015
    • (2015)Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC ApplicationsProceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing10.1145/2749246.2749253(275-278)Online publication date: 15-Jun-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media