skip to main content
10.1145/3123939.3123945acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Detecting and mitigating data-dependent DRAM failures by exploiting current memory content

Published: 14 October 2017 Publication History

Abstract

DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.
In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle.
Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65--74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.

References

[1]
"STREAM Benchmark," http://www.streambench.org/.
[2]
"Oral history of Joel Karp," Computer History Museum, 2003.
[3]
J. Ahn et al., "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," in ISCA, 2015.
[4]
J. Ahn et al., "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture," in ISCA, 2015.
[5]
Z. Al-Ars et al., "Effects of bit line coupling on the faulty behavior of DRAMs," in VTS, 2004.
[6]
B. Arnold, Pareto distributions. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1983.
[7]
A. Bacchini et al., "Characterization of data retention faults in DRAM devices," in DFT, 2014.
[8]
Y. Bao et al., "HMTT: A platform independent full-system memory trace monitoring system," in SIGMETRICS, 2008.
[9]
R. E. Barlow et al., "Properties of probability distributions with monotone hazard rate," The Annals of Mathematical Statistics, 1963.
[10]
G. B. Bell et al., "Characterization of silent stores," in PACT, 2000.
[11]
S. Cha et al., "Defect analysis and cost-effective resilience architecture for future DRAM devices," in HPCA, 2017.
[12]
K. Chandrasekar et al., "Exploiting Expendable Process-margins in DRAMs for Run-time Performance Optimization," in DATE, 2014.
[13]
K. K. Chang, "Understanding and improving the latency of DRAM-based memory systems," Ph.D. dissertation, CMU, 2017.
[14]
K. K. Chang et al., "Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization," in SIGMETRICS, 2016.
[15]
K. K. Chang et al., "Improving DRAM performance by parallelizing refreshes with accesses," in HPCA, 2014.
[16]
K. K. Chang et al., "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM," in HPCA, 2016.
[17]
K. K. Chang et al., "Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms," in SIGMETRICS, 2017.
[18]
M. E. Crovella and A. Bestavros, "Self-similarity in world wide web traffic: Evidence and possible causes," IEEE/ACM Trans. Netw., 1997.
[19]
M. Harchol-Balter and A. B. Downey, "Exploiting process lifetime distributions for dynamic load balancing," in SIGMETRICS, 1996.
[20]
H. Hassan et al., "ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality," in HPCA, 2016.
[21]
H. Hassan et al., "SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies," in HPCA, 2017.
[22]
M. Horiguchi and K. Itoh, Repair for Nanoscale Memories. Springer, 2011.
[23]
K. Hsieh et al., "Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation," in ICCD, 2016.
[24]
A. A. Hwang et al., "Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design," in ASPLOS, 2012.
[25]
C. Isen and L. John, "ESKIMO - Energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem," in ISCA, 2009.
[26]
JEDEC, JEDEC Standard: Low Power Double Data Rate 2 (LPDDR2), 2010.
[27]
JEDEC, Standard No. 79--3F. DDR3 SDRAM Specification, 2012.
[28]
JEDEC, Standard No. 79--4B. DDR4 SDRAM Specification, 2017.
[29]
M. Jung et al., "Reverse engineering of DRAMs: Row hammer with crosshair," in MEMSYS, 2016.
[30]
U. Kang et al., "Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling," in The Memory Forum, 2014.
[31]
S. Khan et al., "PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM," in DSN, 2016.
[32]
S. Khan et al., "The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study," in SIGMETRICS, 2014.
[33]
S. Khan et al., "A case for memory content-based detection and mitigation of data-dependent failures in DRAM," in IEEE CAL, 2016.
[34]
S. Khan et al., MEMCON Data Repository, https://github.com/samirakhan/MEMCON-data, 2017.
[35]
K. Kim, "Technology for sub-50nm DRAM and NAND flash manufacturing," in IEDM, 2005.
[36]
Y. Kim et al., "Ramulator: A Fast and Extensible DRAM Simulator," in IEEE CAL, 2015.
[37]
Y. Kim et al., Ramulator Repository, https://github.com/CMU-SAFARI/ramulator, 2015.
[38]
Y. Kim, "Architectural techniques to enhance DRAM scaling," Ph.D. dissertation, CMU, 2015.
[39]
Y. Kim et al., "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," in ISCA, 2014.
[40]
Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers." in HPCA, 2010.
[41]
Y. Kim et al., "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM," in ISCA, 2012.
[42]
D. Lee et al., "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM," in PACT, 2015.
[43]
D. Lee, "Reducing DRAM latency at low cost by exploiting heterogeneity," Ph.D. dissertation, CMU, 2015.
[44]
D. Lee et al., "Reducing DRAM latency by exploiting design-induced latency variation in modern DRAM chips," in ArXiv, 2016.
[45]
D. Lee et al., "Design-induced latency variation in modern DRAM chips: Characterization, analysis, and latency reduction mechanisms," in SIGMETRICS, 2017.
[46]
D. Lee et al., "Adaptive-latency DRAM: Optimizing DRAM timing for the common-case," in HPCA, 2015.
[47]
D. Lee et al., "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013.
[48]
K. M. Lepak and M. H. Lipasti, "Silent stores for free," in MICRO, 2000.
[49]
K. M. Lepak and M. H. Lipasti, "On the value locality of store instructions," in ISCA, 2000.
[50]
K. M. Lepak and M. H. Lipasti, "Temporally silent stores," in ASPLOS, 2002.
[51]
J. Liu et al., "An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms," in ISCA, 2013.
[52]
J. Liu et al., "RAIDR: Retention-Aware Intelligent DRAM Refresh," in ISCA, 2012.
[53]
S. Liu et al., "Flikker: Saving DRAM Refresh-power Through Critical Data Partitioning," in ASPLOS, 2011.
[54]
C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in PLDI, 2005.
[55]
Y. Luo et al., "Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory," in DSN, 2014.
[56]
J. A. Mandelman et al., "Challenges and future directions for the scaling of dynamic random-access memory (DRAM)," IBM J. of Res. and Dev., 2002.
[57]
J. Meza et al., "Enabling efficient and scalable hybrid memories using fine-granularity DRAM cache management," IEEE CAL, vol. 11, 2012.
[58]
J. Meza et al., "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field," in DSN, 2015.
[59]
W. Mueller et al., "Challenges for the DRAM cell scaling to 40nm," in IEDM, 2005.
[60]
J. Mukundan et al., "Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems," in ISCA, 2013.
[61]
O. Mutlu, "The rowhammer problem and other issues we may face as memory becomes denser," in DATE, 2017.
[62]
O. Mutlu, "Memory scaling: A systems architecture perspective," IMW, 2013.
[63]
O. Mutlu and L. Subramanian, "Research problems and opportunities in memory systems," SUPERFRI, 2014.
[64]
P. Nair et al., "A case for refresh pausing in DRAM memory systems," in HPCA, 2013.
[65]
P. Nair et al., "ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates," in ISCA, 2013.
[66]
Y. Nakagome et al., "The impact of data-line interference noise on DRAM scaling," JSSC, 1988.
[67]
M. Patel et al., "The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions," in ISCA, 2017.
[68]
V. Paxson and S. Floyd, "Wide area traffic: The failure of Poisson modeling," IEEE/ACM Transactions on Networking, 1995.
[69]
E. Perelman et al., "Using SimPoint for accurate and efficient simulation," in SIGMETRICS, 2003.
[70]
M. Qureshi et al., "AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems," in DSN, 2015.
[71]
M. Qureshi and G. H. Loh, "Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-Tags with a simple and practical design," in MICRO, 2012.
[72]
M. Redeker et al., "An investigation into crosstalk noise in DRAM structures," in MTDT, 2002.
[73]
B. Schroeder and M. Harchol-Balter, "Evaluation of task assignment policies for supercomputing servers: The case for load unbalancing and fairness," Cluster Computing, vol. 7, no. 2, Apr. 2004.
[74]
B. Schroeder et al., "DRAM errors in the wild: A large-scale field study," in SIGMETRICS, 2009.
[75]
V. Seshadri et al., "Fast bulk bitwise AND and OR in dram," in IEEE CAL, 2015.
[76]
V. Seshadri et al., "RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization," in MICRO, 2013.
[77]
V. Seshadri et al., "Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology," in MICRO, 2017.
[78]
V. Seshadri et al., "Gather-scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses," in MICRO, 2015.
[79]
V. Seshadri and O. Mutlu, Simple Operations in Memory to Reduce Data Movement. Advances in Computers, 2016.
[80]
SPEC CPU2006, "Standard Performance Evaluation Corporation," http://www.spec.org/cpu2006.
[81]
V. Sridharan et al., "Memory errors in modern systems: The good, the bad, and the ugly," in ASPLOS, 2015.
[82]
V. Sridharan and D. Liberty, "A Study of DRAM Failures in the Field," in SC, 2012.
[83]
J. Stuecheli et al., "Elastic refresh: Techniques to mitigate refresh penalties in high density memory," in ISCA, 2010.
[84]
Transaction Processing Performance Council, "TPC 2011," http://www.tpc.org/.
[85]
A. J. van de Goor and I. Schanstra, "Address and Data Scrambling: Causes and Impact on Memory Tests," in DELTA, 2002.
[86]
R. Venkatesan et al., "Retention-Aware Placement in DRAM (RAPID): Software Methods for Quasi-Non-Volatile DRAM," in HPCA, 2006.
[87]
Xilinx, ML605 Hardware User Guide, 2012.
[88]
D. H. Yoon and M. Erez, "Virtualized and Flexible ECC for Main Memory," in ASPLOS, 2010.
[89]
X. Yu et al., "Banshee: Bandwidth-efficient DRAM caching via software/hardware cooperation," in MICRO, 2017.

Cited By

View all
  • (2024)Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00048(560-577)Online publication date: 2-Mar-2024
  • (2024)Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00022(75-89)Online publication date: 24-Jun-2024
  • (2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. data-dependent failures
  3. energy
  4. fault tolerance
  5. memory systems
  6. performance
  7. refresh
  8. reliability
  9. retention failures
  10. system-level failure detection and mitigation

Qualifiers

  • Research-article

Funding Sources

  • Semiconductor Research Corporation
  • Intel Science and Technology Center for Cloud Computing
  • NSF

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)170
  • Downloads (Last 6 weeks)32
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00048(560-577)Online publication date: 2-Mar-2024
  • (2024)Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00022(75-89)Online publication date: 24-Jun-2024
  • (2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
  • (2023)DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM ChipsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328217242:12(5098-5112)Online publication date: Dec-2023
  • (2023)Extending Memory Capacity in Modern Consumer Systems With Emerging Non-Volatile Memory: Experimental Analysis and Characterization Using the Intel Optane SSDIEEE Access10.1109/ACCESS.2023.331788411(105843-105871)Online publication date: 2023
  • (2022)Hybrid Refresh: Improving DRAM Performance by Handling Weak Rows SmartlyProceedings of the 2022 International Symposium on Memory Systems10.1145/3565053.3565060(1-11)Online publication date: 3-Oct-2022
  • (2022)PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAMACM Transactions on Architecture and Code Optimization10.1145/356369720:1(1-31)Online publication date: 17-Nov-2022
  • (2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
  • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
  • (2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media