skip to main content
10.1145/2591971.2592000acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study

Published: 16 June 2014 Publication History

Abstract

As DRAM cells continue to shrink, they become more susceptible to retention failures. DRAM cells that permanently exhibit short retention times are fairly easy to identify and repair through the use of memory tests and row and column redundancy. However, the retention time of many cells may vary over time due to a property called Variable Retention Time (VRT). Since these cells intermittently transition between failing and non-failing states, they are particularly difficult to identify through memory tests alone. In addition, the high temperature packaging process may aggravate this problem as the susceptibility of cells to VRT increases after the assembly of DRAM chips. A promising alternative to manufacture-time testing is to detect and mitigate retention failures after the system has become operational. Such a system would require mechanisms to detect and mitigate retention failures in the field, but would be responsive to retention failures introduced after system assembly and could dramatically reduce the cost of testing, enabling much longer tests than are practical with manufacturer testing equipment.
In this paper, we analyze the efficacy of three common error mitigation techniques (memory tests, guardbands, and error correcting codes (ECC)) in real DRAM chips exhibiting both intermittent and permanent retention failures. Our analysis allows us to quantify the efficacy of recent system-level error mitigation mechanisms that build upon these techniques. We revisit prior works in the context of the experimental data we present, showing that our measured results significantly impact these works' conclusions. We find that mitigation techniques that rely on run-time testing alone [38, 27, 50, 26] are unable to ensure reliable operation even after many months of testing. Techniques that incorporate ECC[4, 52], however, can ensure reliable DRAM operation after only a few hours of testing. For example, VS-ECC[4], which couples testing with variable strength codes to allocate the strongest codes to the most error-prone memory regions, can ensure reliable operation for 10 years after only 19 minutes of testing. We conclude that the viability of these mitigation techniques depend on efficient online profiling of DRAM performed without disrupting system operation.

References

[1]
R. D. Adams. High performance memory testing: Design principles, fault modeling and self-test. Springer, 2003.
[2]
J.-H. Ahn et al. Adaptive self refresh scheme for battery operated high-density mobile DRAM applications. ASSCC, 2006.
[3]
Z. Al-Ars et al. DRAM-specific space of memory tests. ITC, 2006.
[4]
A. R. Alameldeen et al. Energy-efficient cache design using variable-strength error-correcting codes. ISCA, 2011.
[5]
R. Baumann. The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. IEDM, 2002.
[6]
K. Chang et al. Improving DRAM performance by parallelizing refreshes with accesses. HPCA, 2014.
[7]
M. de Kruijf et al. Relax: An architectural framework for software recovery of hardware faults. ISCA, 2010.
[8]
P. G. Emma et al. Rethinking refresh: Increasing availability and reducing power in DRAM for cache applications. IEEE Micro, 28(6), Nov. 2008.
[9]
H. Esmaeilzadeh et al. Neural acceleration for general-purpose approximate programs. MICRO, 2012.
[10]
D. Frank et al. Device scaling limits of Si MOSFETs and their application dependencies. Proceedings of the IEEE, 89(3), 2001.
[11]
T. Hamamoto et al. On the retention time distribution of Dynamic Random Access Memory (DRAM). 1998.
[12]
P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. TNS, 47(6), 2000.
[13]
A. Hiraiwa et al. Local-field-enhancement model of DRAM retention failure. IEDM, 1998.
[14]
C.-S. Hou et al. An FPGA-based test platform for analyzing data retention time distribution of DRAMs. VLSI-DAT, 2013.
[15]
JEDEC. Standard No. 79-3F. DDR3 SDRAM Specification, July 2012.
[16]
S. Khan et al. The efficacy of error mitigation techniques for DRAM retention failures: A comparative experimental study -- Full data sets. http://www.ece.cmu.edu/~safari/tools/dram-sigmetrics2014-fulldata.html.
[17]
H. Kim et al. Characterization of the variable retention time in dynamic random access memory. IEEE Trans. Electron Dev., 58(9), 2011.
[18]
K. Kim. Technology for sub-50nm DRAM and NAND flash manufacturing. IEDM, 2005.
[19]
K. Kim and J. Lee. A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron Device Letters, 30(8), 2009.
[20]
Y. Kim et al. A case for exploiting subarray-level parallelism (SALP) in DRAM. ISCA, 2012.
[21]
Y. I. Kim et al. Thermal degradation of DRAM retention time: Characterization and improving techniques. IRPS, 2004.
[22]
D. Lee et al. Tiered-latency DRAM: A low latency and low cost DRAM architecture. HPCA, 2013.
[23]
M. J. Lee and K. W. Park. A mechanism for dependence of refresh time on data pattern in DRAM. Electron Device Letters, 31(2), 2010.
[24]
X. Li et al. A realistic evaluation of memory hardware errors and software system susceptibility. ATC, 2010.
[25]
X. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. HPCA, 2007.
[26]
C.-H. Lin et al. SECRET: Selective error correction for refresh energy reduction in DRAMs. ICCD, 2012.
[27]
J. Liu et al. RAIDR: Retention-aware intelligent DRAM refresh. ISCA, 2012.
[28]
J. Liu et al. An experimental study of data retention behavior in modern DRAM devices: Implications for retention time profiling mechanisms. ISCA, 2013.
[29]
S. Liu et al. Flikker: Saving DRAM refresh-power through critical data partitioning. ASPLOS, 2011.
[30]
Y. Luo. Characterizing application memory error vulnerability to optimize data center cost. DSN, 2014.
[31]
J. A. Mandelman et al. Challenges and future directions for the scaling of dynamic random-access memory (DRAM). IBM J. of Res. and Dev., 2002.
[32]
T. C. May et al. Alpha-particle-induced soft errors in dynamic memories. IEEE Trans. Electron Dev., 1979.
[33]
Y. Mori et al. The origin of variable retention time in DRAM. IEDM, 2005.
[34]
W. Mueller et al. Challenges for the DRAM cell scaling to 40nm. IEDM, 2005.
[35]
S. S. Mukherjee et al. The soft error problem: An architectural perspective. HPCA, 2005.
[36]
O. Mutlu. Memory scaling: A systems architecture perspective. IMW, 2013.
[37]
P. Nair et al. A case for refresh pausing in DRAM memory systems. HPCA, 2012.
[38]
P. J. Nair et al. ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates. ISCA, 2013.
[39]
H.-D. Oberle et al. Enhanced fault modeling for DRAM test and analysis. VTS, 1991.
[40]
T. J. O'Gorman. The effect of cosmic rays on the soft error rate of a DRAM at ground level. IEEE Trans. Electron Dev., 41(4), 1994.
[41]
P. J. Restle, J. W. Park, and B. F. Lloyd. DRAM variable retention time. IEDM, 1992.
[42]
S. E. Schechter, G. H. Loh, et al. Use ECP, not ECC, for hard failures in resistive memories. ISCA, 2010.
[43]
B. Schroeder et al. DRAM errors in the wild: A large-scale field study. SIGMETRICS, 2009.
[44]
H. W. Seo et al. Charge trapping induced DRAM data retention time degradation under wafer-level burn-in stress. IRPS, 2002.
[45]
V. Sridharan et al. Feng Shui of supercomputer memory: Positional effects in DRAM and SRAM faults. SC, 2013.
[46]
V. Sridharan and D. Liberty. A study of DRAM failures in the field. SC, 2012.
[47]
G. R. Srinivasan et al. Accurate, predictive modeling of soft error rate due to cosmic rays and chip alpha radiation. IRPS, 1994.
[48]
A. J. van de Goor et al. An overview of deterministic functional RAM chip testing. ACM Computing Surveys, 1990.
[49]
A. J. van de Goor and A. Paalvast. Industrial evaluation of DRAM SIMM tests. ITC, 2000.
[50]
R. K. Venkatesan et al. Retention-aware placement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM. HPCA, 2006.
[51]
M.-J. Wang et al. Guardband determination for the detection of off-state and junction leakages in DRAM testing. ATS, 2001.
[52]
C. Wilkerson et al. Reducing cache power with low-cost, multi-bit error-correcting codes. ISCA, 2010.
[53]
Xilinx. ML605 Hardware User Guide, Oct. 2012.
[54]
K. Yamaguchi. Theoretical study of deep-trap-assisted anomalous currents in worst-bit cells of dynamic random-access memories (DRAM's). IEEE Trans. Electron Dev., 47(4), 2000.
[55]
D. Yaney et al. A meta-stable leakage phenomenon in DRAM charge storage - Variable hold time. IEDM, 1987.
[56]
D. H. Yoon and M. Erez. Virtualized and flexible ECC for main memory. ASPLOS, 2010.

Cited By

View all
  • (2024)SpyHammer: Understanding and Exploiting RowHammer Under Fine-Grained Temperature VariationsIEEE Access10.1109/ACCESS.2024.340938912(80986-81003)Online publication date: 2024
  • (2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
  • (2023)DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM ChipsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328217242:12(5098-5112)Online publication date: Dec-2023
  • Show More Cited By

Index Terms

  1. The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems
      June 2014
      614 pages
      ISBN:9781450327893
      DOI:10.1145/2591971
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dram
      2. ecc
      3. error correction
      4. fault tolerance
      5. memory scaling
      6. retention failures
      7. system-level detection and mitigation

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMETRICS '14
      Sponsor:

      Acceptance Rates

      SIGMETRICS '14 Paper Acceptance Rate 40 of 237 submissions, 17%;
      Overall Acceptance Rate 459 of 2,691 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)72
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)SpyHammer: Understanding and Exploiting RowHammer Under Fine-Grained Temperature VariationsIEEE Access10.1109/ACCESS.2024.340938912(80986-81003)Online publication date: 2024
      • (2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
      • (2023)DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM ChipsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328217242:12(5098-5112)Online publication date: Dec-2023
      • (2023)Extending Memory Capacity in Modern Consumer Systems With Emerging Non-Volatile Memory: Experimental Analysis and Characterization Using the Intel Optane SSDIEEE Access10.1109/ACCESS.2023.331788411(105843-105871)Online publication date: 2023
      • (2023)Voltage Reduced Self Refresh (VRSR) for optimized energy savings in DRAM MemoriesMemories - Materials, Devices, Circuits and Systems10.1016/j.memori.2023.100058(100058)Online publication date: May-2023
      • (2022)HiRA: Hidden Row Activation for Reducing Refresh Latency of Off-the-Shelf DRAM ChipsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00062(815-834)Online publication date: 1-Oct-2022
      • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
      • (2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
      • (2021)Uncovering In-DRAM RowHammer Protection Mechanisms:A New Methodology, Custom RowHammer Patterns, and ImplicationsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480110(1198-1213)Online publication date: 18-Oct-2021
      • (2021)A Deeper Look into RowHammer’s Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and DefensesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480069(1182-1197)Online publication date: 18-Oct-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media