Improving DRAM Performance, Reliability, and Security by Rigorously Understanding Intrinsic DRAM Operation

Open access
Author
Date
2022Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
DRAM is the primary technology used for main memory in modern systems. Unfortunately, as DRAM scales down to smaller technology nodes, it faces key challenges in both data integrity and latency, which strongly affect overall system reliability, security, and performance. To develop reliable, secure, and high-performance DRAM-based main memory for future systems, it is critical to rigorously characterize, analyze, and understand various aspects (e.g., reliability, retention, latency, RowHammer vulnerability) of existing DRAM chips and their architecture. The goal of this dissertation is to 1) develop techniques and infrastructures to enable such rigorous characterization, analysis, and understanding, and 2) enable new mechanisms to improve DRAM performance, reliability, and security based on the developed understanding.
To this end, in this dissertation, we 1) design, implement, and prototype a new practical-to-use and flexible FPGA-based DRAM characterization infrastructure (called SoftMC), 2) use the DRAM characterization infrastructure to develop a new experimental methodology (called U-TRR) to uncover the operation of existing proprietary in-DRAM RowHammer protection mechanisms and craft new RowHammer access patterns to efficiently circumvent these RowHammer protection mechanisms, 3) propose a new DRAM architecture, called SelfManaging DRAM, for enabling autonomous and efficient in-DRAM maintenance operations that enable not only better performance, efficiency, and reliability but also faster and easier adoption of changes to DRAM chips, and 4) propose a versatile DRAM substrate, called the Copy-Row (CROW) substrate, that enables new mechanisms for improving DRAM performance, energy consumption, and reliability.
SoftMC. To develop reliable and high-performance DRAM-based main memory in future systems, it is critical to experimentally characterize, understand, and analyze various aspects (e.g., reliability, latency) of existing DRAM chips. To enable this, there is a strong need for a publicly-available DRAM testing infrastructure that can flexibly and efficiently test DRAM chips in a manner accessible to both software and hardware developers. To this end, we design and prototype SoftMC: a flexible and practical FPGA-based DRAM testing infrastructure. SoftMC implements all low-level DRAM operations (i.e., DDR commands) available in a typical memory controller (e.g., opening a row in a bank, reading a specific column address, performing a refresh operation, enforcing various timing constraints between commands). Using these low-level operations, SoftMC can test and characterize any (existing or new) DRAM mechanism that uses the existing DDR interface. SoftMC provides its users with a simple and intuitive high-level programming interface that completely hides the low-level details of the FPGA. SoftMC is freely available as an open-source tool and it has enabled many research projects since its release, leading to new understanding and new techniques.
U-TRR. RowHammer is a critical vulnerability in modern DRAM chips that can lead to reliability, safety, and security problems in computing systems. As such, DRAM vendors have been implementing techniques to protect DRAM chips against RowHammer. We challenge the claim of DRAM vendors that their DRAM chips are completely protected against RowHammer using proprietary, undocumented, and obscure on-die Target Row Refresh (TRR) mechanisms. To assess the security guarantees of recent DRAM chips, we develop Uncovering TRR (U-TRR), a new experimental methodology to analyze in-DRAM TRR implementations. U-TRR is based on the new observation that data retention failures in DRAM enable a side channel that leaks information on how TRR refreshes potential victim rows. U-TRR allows us to (i) understand how logical DRAM rows are laid out physically in silicon; (ii) study undocumented on-die TRR mechanisms; and (iii) combine (i) and (ii) to evaluate the RowHammer security guarantees of modern DRAM chips. We show how U-TRR allows us to craft RowHammer access patterns that successfully circumvent the TRR mechanisms employed in 45 DRAM modules of the three major DRAM vendors. We find that the DRAM modules we analyze are vulnerable to RowHammer, having bit flips in up to 99.9% of all DRAM rows and that simple error-correcting codes cannot prevent bit flips. As such, more robust techniques to protect against the RowHammer vulnerability are necessary. We publicly release the source code of our implementation of the U-TRR methodology.
Self-Managing DRAM. To ensure reliable and secure DRAM operation, three types of maintenance operations are typically required: 1) DRAM refresh, 2) RowHammer protection, and 3) memory scrubbing. The reliability and security of DRAM chips continuously worsen as DRAM technology node scales to smaller sizes. Consequently, new DRAM chip generations necessitate making existing maintenance operations more aggressive (e.g., lowering the refresh period) and introducing new types of maintenance operations (e.g., targeted refresh for mitigating RowHammer) while keeping the overheads of maintenance operations minimal. Unfortunately, modifying the existing DRAM maintenance operations is difficult due to the current rigid DRAM interface that places the memory controller completely in charge of DRAM control. Implementing new or modifying existing maintenance operations often require difficult-to-realize changes in the DRAM interface, the memory controller, and potentially other system components (e.g., system software). Our goal is to 1) ease, and thus accelerate, the process of implementing new DRAM maintenance operations and 2) enable more efficient in-DRAM maintenance operations. To this end, we propose Self-Managing DRAM (SMD), a new low-cost DRAM architecture that enables implementing new in-DRAM maintenance mechanisms with no further changes in the DRAM interface, memory controller, or other system components. We use SMD to implement six maintenance mechanisms for three use cases: 1) DRAM refresh, 2) RowHammer protection, and 3) memory scrubbing. Our evaluations show that SMD-based maintenance operations have significantly lower system performance and energy overheads compared to conventional DDR4 DRAM. A combination of SMD-based maintenance mechanisms that perform refresh, RowHammer protection, and memory scrubbing achieve significant speedup and lower DRAM energy across a wide variety of system configurations. SMD’s benefits increase as DRAM chips become denser. We publicly release all SMD source code and data.
CROW. Three major challenges to DRAM scaling (i.e., high access latencies, high refresh overheads, and increasing reliability problems like RowHammer) are difficult to solve efficiently by directly modifying the underlying cell array structure. This is because commodity DRAM implements an extremely dense DRAM cell array that is optimized for low area-per-bit. Because of its density, even a small change in the DRAM cell array structure may incur non-negligible area overhead. Thus, we would like to lower the DRAM access latency, reduce the refresh overhead, and improve DRAM reliability with no changes to the DRAM cell architecture, and with only minimal changes to the DRAM chip. To this end, we propose Copy-Row DRAM (CROW), a flexible substrate that enables new mechanisms for improving DRAM performance, energy efficiency, and reliability. We use the CROW substrate to implement 1) a low-cost in-DRAM caching mechanism that lowers DRAM activation latency to frequently accessed rows by 38% and 2) a mechanism that avoids the use of short-retention-time rows to mitigate the performance and energy overhead of DRAM refresh operations. CROW’s flexibility allows the implementation of both mechanisms at the same time. Our evaluations show that the two CROW-based mechanisms synergistically improve system performance by 20.0% and reduce DRAM energy by 22.3% for memory-intensive four-core workloads. We publicly release the source code of CROW.
Holistically, via these four major contributions, this dissertation shows the importance of rigorously characterizing the reliability, latency, and RowHammer vulnerability of existing DRAM chips and understanding their architecture for developing practical and low-overhead mechanisms for efficiently improving DRAM reliability, security, and performance. We believe and hope that, via the new infrastructure, understanding, and techniques we develop and enable, this dissertation encourages similar experimental understanding-driven innovation in the design of future memories. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000601914Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Mutlu, Onur
Examiner: Chiou, Derek
Examiner: Erez, Mattan
Examiner: O'Connor, Mike
Examiner: Qureshi, Moinuddin
Examiner: Weis, Christian
Publisher
ETH ZurichSubject
Memory systems; DRAM; RowHammerOrganisational unit
09483 - Mutlu, Onur / Mutlu, Onur
More
Show all metadata
ETH Bibliography
yes
Altmetrics