skip to main content
10.1145/2594291.2594298acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Compiler-assisted detection of transient memory errors

Published: 09 June 2014 Publication History

Abstract

The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest.
In this paper, we consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced.

References

[1]
A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin. The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, C-20(11), Nov 1971.
[2]
C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Languages and Compilers for Parallel Computing, 2004.
[3]
R. Baumann. Soft errors in advanced computer systems. Design & Test of Computers, IEEE, 22(3), 2005.
[4]
M. Blum, W. Evans, P. Gemmell, S. Kannan, and M. Naor. Checking the correctness of memories. Algorithmica, 12(2-3), 1994.
[5]
S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Micro, IEEE, 25(6), 2005.
[6]
J. D. Bright, G. F. Sullivan, and G. M. Masson. Checking the integrity of trees. In Fault-Tolerant Computing, 1995.
[7]
G. Chen, M. Kandemir, and M. Karakoy. A data-centric approach to checksum reuse for array-intensive applications. In International Conference on Dependable Systems and Networks, 2005.
[8]
R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near-threshold computing: Reclaiming moore's law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2), 2010.
[9]
P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 1991.
[10]
P. Feautrier. Some efficient solutions to the affine scheduling problem: I. one-dimensional time. International journal of parallel programming, 21(5), 1992.
[11]
S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34(3), 2006.
[12]
B. T. Gold, M. Ferdman, B. Falsafi, and K. Mai. Mitigating multi-bit soft errors in L1 caches using last-store prediction. In Workshop on Architectural Support for Gigascale Integration, 2007.
[13]
O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante. Soft-error detection using control flow assertions. In Defect and Fault Tolerance in VLSI Systems, 2003.
[14]
M. Gomaa, C. Scarbrough, T. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Computer Architecture, 2003.
[15]
M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28(6), 2000.
[16]
S. K. S. Hari, S. V. Adve, and H. Naeimi. Low-cost program-level detectors for reducing silent data corruptions. In International Conference on Dependable Systems and Networks, 2012.
[17]
S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2012.
[18]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6), 1984.
[19]
ISL: Integer Set Library. http://garage.kotnet.org/~skimo/isl/.
[20]
Y. Liang, Y. Zhang, M. Jette, A. Sivasubramaniam, and R. Sahoo. BlueGene/L failure analysis and prediction models. In International Conference on Dependable Systems and Networks, 2006.
[21]
S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. Flikker: Saving dram refresh-power through critical data partitioning. In Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2011.
[22]
J. Maiz, S. Hareland, K. Zhang, and P. Armstrong. Characterization of multi-bit soft error events in advanced SRAMs. In IEEE International Electron Devices Meeting, 2003.
[23]
T. C. Maxino. The effectiveness of checksums for embedded networks. Master's thesis, Carnegie Mellon University, 2006.
[24]
S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in los alamos national laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.
[25]
J. Nickolls and W. J. Dally. The GPU computing era. IEEE micro, 30(2), 2010.
[26]
M. Nicolaidis. Efficient implementations of self-checking adders and ALUs. In Fault-Tolerant Computing, 1993.
[27]
N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. IEEE Transactions on Reliability, 51(1), 2002.
[28]
N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability, 51(1), 2002.
[29]
K. Osada, K. Yamaguchi, Y. Saitoh, and T. Kawahara. SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect. IEEE Journal of Solid-State Circuits, 39(5), 2004.
[30]
T. Osada and M. Godwin. International technology roadmap for semiconductors. 1999.
[31]
K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. Dynamic derivation of application-specific error detectors and their implementation in hardware. In European Dependable Computing Conference, 2006.
[32]
PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.
[33]
R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing, 1993.
[34]
H. Quinn and P. Graham. Terrestrial-based radiation upsets: A cautionary tale. In Field-Programmable Custom Computing Machines, 2005.
[35]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Code generation and optimization, 2005.
[36]
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Fault-Tolerant Computing, 1999.
[37]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Measurement and modeling of computer systems, 2009.
[38]
P. P. Shirvani, N. R. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. IEEE Transactions on Reliability, 49(3), 2000.
[39]
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Dependable Systems and Networks, 2007.
[40]
S. Verdoolaege. isl: An integer set library for the polyhedral model. Mathematical Software--ICMS 2010, 2010.
[41]
N. J. Wang and S. J. Patel. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3), 2006.
[42]
D. H. Yoon and M. Erez. Flexible cache error protection using an ECC FIFO. In High Performance Computing Networking, Storage and Analysis, SC, 2009.
[43]
D. H. Yoon and M. Erez. Memory mapped ECC: low-cost error protection for last level caches. In International Symposium on Computer Architecture, ISCA, 2009.
[44]
J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin. IBM experiments in soft fails in computer electronics (1978--1994). IBM journal of research and development, 40(1), 1996.

Cited By

View all
  • (2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
  • (2017)Generic Soft-Error Detection and Correction for Concurrent Data StructuresIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2015.242783214:1(22-36)Online publication date: 1-Jan-2017
  • (2017)A Gaussian Process Approach for Effective Soft Error Detection2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.129(608-612)Online publication date: Sep-2017
  • Show More Cited By

Index Terms

  1. Compiler-assisted detection of transient memory errors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2014
    619 pages
    ISBN:9781450327848
    DOI:10.1145/2594291
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 6
      PLDI '14
      June 2014
      598 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2666356
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. checksums
    2. def-use tracking
    3. transient memory errors

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    PLDI '14
    Sponsor:

    Acceptance Rates

    PLDI '14 Paper Acceptance Rate 52 of 287 submissions, 18%;
    Overall Acceptance Rate 406 of 2,067 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
    • (2017)Generic Soft-Error Detection and Correction for Concurrent Data StructuresIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2015.242783214:1(22-36)Online publication date: 1-Jan-2017
    • (2017)A Gaussian Process Approach for Effective Soft Error Detection2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.129(608-612)Online publication date: Sep-2017
    • (2016)New-SumProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907306(43-55)Online publication date: 31-May-2016
    • (2016)A Survey of Techniques for Modeling and Improving Reliability of Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.242617927:4(1226-1238)Online publication date: 1-Apr-2016
    • (2023)Automatic Algorithm-Based Fault Tolerance (AABFT) of Stencil Computations2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00024(187-198)Online publication date: 21-Oct-2023
    • (2017)MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.128(717-724)Online publication date: Sep-2017
    • (2015)Impact of Loop Transformations on Software ReliabilityProceedings of the IEEE/ACM International Conference on Computer-Aided Design10.5555/2840819.2840859(278-285)Online publication date: 2-Nov-2015
    • (2015)Impact of loop transformations on software reliability2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1109/ICCAD.2015.7372581(278-285)Online publication date: Nov-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media