ABSTRACT
Due to the recent growth in the number of on-chip cores available in today's multi-core processors, there is an increased demand for memory bandwidth and capacity. However, off-chip DRAM is not scaling at the rate necessary for the growth in number of on-chip cores. Stacked DRAM last-level caches have been proposed to alleviate these bandwidth constraints, however, many of these ideas are not practical for real systems, or may not take advantage of the features available in today's stacked DRAM variants.
In this paper, we design a last-level, stacked DRAM cache that is practical for real-world systems and takes advantage of High Bandwidth Memory (HBM) [1]. Our HBM cache only requires one minor change to existing memory controllers to support communication. It uses HBM's built-in logic die to handle tag storage and lookups. We also introduce novel tag/data storage that enables faster lookups, associativity, and more capacity than previous designs.
- JEDEC Standard, "High Bandwidth Memory (HBM) DRAM," in JESD235A, 2015.Google Scholar
- M. K. Qureshi and G. H. Loh, "Fundamental latency tradeoff in architecting DRAM caches: Outperforming impractical SRAM-tags with a simple and practical design", International Symposium on Microarchitecture, 2012, pp. 235--246. Google ScholarDigital Library
- D. Milojevic, S. Idgunji, D. Jevdjic, E. Ozer, P. Lotfi-Kamran, A. Panteli, A. Prodromou, C. Nicopoulos, D. Hardy, B. Falsari et al., "Thermal characterization of cloud workloads on a power-efficient server-on-chip", International Conference on Computer Design (ICCD), 2012, pp. 175--182. Google ScholarDigital Library
- M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. Loh, "Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories", International Symposium on High Performance Computer Architecture (HPCA), 2015.Google ScholarCross Ref
- S. Mittal, and J.S. Vetter, "A Survey Of Techniques for Architecting DRAM Caches", IEEE Transactions on Parallel and Distributed Systems, 2015.Google Scholar
- R. Kalla, B. Sinharoy, W.J. Starke, and M. Floyd, "Power7: IBM's Next-Generation Server Processor", IEEE Micro, 2010, vol. 30, no. 2, pp. 7--15. Google ScholarDigital Library
- M.-T. Chang, P. Rosenfeld, S.-L. Lu, and B. Jacob, "Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM", International Symposium on High Performance Computer Architecture (HPCA), 2013. Google ScholarDigital Library
- Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, "A case for exploiting subarray-level parallelism (SALP) in DRAM", International Symposium on Computer Architecture (ISCA), 2012, pp. 368--379. Google ScholarDigital Library
- (2014). {Online}. Available: http://wccftech.com/intel-xeon-phiknights-landing-processors-stacked-dram-hmc-16gb/Google Scholar
- (2015). {Online}. Available: http://www.amd.com/en-us/innovations/software-technologies/hbmGoogle Scholar
- B. Pourshirazi and Z. Zhu, "Refree: A Refresh-Free Hybrid DRAM/PCM Main Memory System", International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 566--575.Google ScholarCross Ref
- N. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, "Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth", International Symposium on Microarchitecture (MICRO), 2014, pp. 38--50. Google ScholarDigital Library
- L. Zhao, R. Iyer, R. Illikkal, and D. Newell, "Exploring DRAM cache architectures for CMP server platforms", International Conference on Computer Design (ICCD), 2007, pp. 55--62.Google ScholarCross Ref
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator", SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, 2011. Google ScholarDigital Library
- M. Poremba, T. Zhang, and Y. Xie, "NVMain 2.0: Architectural Simulator to Model (Non-)Volatile Memory Systems", Computer Architecture Letters (CAL), 2015. Google ScholarDigital Library
- O. Naji, A. Hansson, C. Weis, M. Jung, N. Wehn, "A High-Level DRAM Timing, Power and Area Exploration Tool", IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), 2015.Google Scholar
- JEDEC Standard, "DDR4 SDRAM Standard," in JESD79-4A, 2013.Google Scholar
- P. K. Tschirhart, "Multi-Level Main Memory Systems: Technology Choices, Design Considerations, and Trade-off Analysis.", 2015.Google Scholar
- C. Bienia, K. Sanjeev, J.P. Singh, and K. Li, "The PARSEC benchmark suite: characterization and architectural implications", Parallel Architectures and Compilation Techniques (PACT), 2008, pp. 72--81. Google ScholarDigital Library
- D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, "The NAS Parallel Benchmarks", International Journal of High Performance Computing Applications, vol. 5, no. 3, pp. 63--73, 1991. Google ScholarDigital Library
- Architecting HBM as a high bandwidth, high capacity, self-managed last-level cache
Recommendations
Efficient STT-RAM last-level-cache architecture to replace DRAM cache
MEMSYS '17: Proceedings of the International Symposium on Memory SystemsRecent research has proposed die-stacked Last Level Cache (LLC) to overcome the Memory Wall. Lately, Spin-Transfer-Torque Random Access Memory (STT-RAM) caches have been recommended as they provide improved energy efficiency compared to DRAM caches. ...
Coding Last Level STT-RAM Cache for High Endurance and Low Power
STT-RAM technology has recently emerged as one of the most promising memory technologies. However, its major problems, limited write endurance and high write energy, are still preventing it from being used as a drop-in replacement of SRAM cache. In this ...
Architecting the Last-Level Cache for GPUs using STT-RAM Technology
Special Issue on Reliable, Resilient, and Robust Design of Circuits and SystemsFuture GPUs should have larger L2 caches based on the current trends in VLSI technology and GPU architectures toward increase of processing core count. Larger L2 caches inevitably have proportionally larger power consumption. In this article, having ...
Comments