Abstract
Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Jun H, Cho J, Lee K, Son H Y, Kim K, Jin H, Kim K. HBM (high bandwidth memory) DRAM technology and architecture. In Proc. the 2017 IEEE International Memory Workshop (IMW), May 2017, pp.1–4. DOI: https://doi.org/10.1109/IMW.2017.7939084.
Hadidi R, Asgari B, Mudassar B A, Mukhopadhyay S, Yalamanchili S, Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Proc. the 2017 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2017, pp.66–75. DOI: https://doi.org/10.1109/IISWC.2017.8167757.
Shahab A, Zhu M, Margaritov A, Grot B. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2018, pp.559–572. DOI: https://doi.org/10.1109/MICRO.2018.00052.
Volos S, Jevdjic D, Falsafi B, Grot B. Fat caches for scale-out servers. IEEE Micro, 2017, 37(2): 90–103. DOI: https://doi.org/10.1109/MM.2017.32.
Nassif N, Munch A O, Molnar C L, Pasdast G, Lyer S V, Yang Z, Mendoza O, Huddart M, Venkataraman S, Kandula S, Marom R, Kern A M, Bowhill B, Mulvihill D R, Nimmagadda S, Kalidindi V, Krause J, Haq M M, Sharma R, Duda K. Sapphire rapids: The next-generation intel Xeon scalable processor. In Proc. the 17th IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2022, pp.44–46. DOI: https://doi.org/10.1109/ISSCC42614.2022.9731107.
Zahran M. The future of high-performance computing. In Proc. the 17th International Computer Engineering Conference (ICENCO), Dec. 2021, pp.129–134. DOI: https://doi.org/10.1109/ICENCO49852.2021.9698918.
Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.454–464. DOI: https://doi.org/10.1145/2155620.2155673.
Loh G, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78. DOI: https://doi.org/10.1109/MM.2012.25.
Qureshi M K, Loh G H. Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.235–246. DOI: https://doi.org/10.1109/MICRO.2012.30.
Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41(3): 404–415. DOI: https://doi.org/10.1145/2508148.2485957.
Shin D, Jang H, Oh K, Lee J W. An energy-efficient dram cache architecture for mobile platforms with PCM-based main memory. ACM Trans. Embedded Computing Systems (TECS), 2022, 21(1): 1–22. DOI: https://doi.org/10.1145/3451995.
Zhang Q, Sui X, Hou R, Zhang L. Line-coalescing DRAM cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449. DOI: https://doi.org/10.1016/j.suscom.2020.100449.
Zhou F, Wu S, Yue J, Jin H, Shen J. Object Fingerprint Cache for Heterogeneous Memory System. IEEE Transactions on Computers, 2023, 72(9): 2496–2507. DOI: https://doi.org/10.1109/TC.2023.3251852.
Chi Y, Yue J, Liao X, Liu H, Jin H. A hybrid memory architecture supporting fine-grained data migration. Frontiers of Computer Science, 2024, 18(2): 182103. DOI: https://doi.org/10.1007/s11704-023-2675-y.
Hameed F, Bauer L, Henkel J. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(4): 651–664. DOI: https://doi.org/10.1109/TCAD.2015.2488488.
Hameed F, Bauer L, Henkel J. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In Proc. the 16th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept. 29–Oct. 4, 2013. DOI: https://doi.org/10.1109/CASES.2013.6662515.
Behnam P, Bojnordi M N. Adaptively reduced DRAM caching for energy-efficient high bandwidth memory. IEEE Trans. Computers, 2022, 71(10): 2675–2686. DOI: https://doi.org/10.1109/TC.2022.3140897.
Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.376–388. DOI: https://doi.org/10.1109/MICRO.2012.42.
Huang C C, Nagarajan V. ATCache: Reducing DRAM cache latency via a small SRAM tag cache. In Proc. the 23rd International Conference on Parallel Architectures and Compilation (PACT), Aug. 2014, pp.51–60. DOI: https://doi.org/10.1145/2628071.2628089.
Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In Proc. the 51st Annual Design Automation Conference (DAC), Jun. 2014. DOI: https://doi.org/10.1145/2593069.2593197.
Chou C, Jaleel A, Qureshi M K. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Computer Architecture News, 2015, 43(3S): 198–210. DOI: https://doi.org/10.1145/2872887.2750387.
Hameed F, Khan A A, Castrillon J. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Computers, 2021, 70(11): 1914–1927. DOI: https://doi.org/10.1109/TC.2020.3029615.
Kawano M, Wang X Y, Ren Q, Loh W L, Rao B C, Chui K J. One-step TSV process development for 4-layer wafer stacked DRAM. In Proc. the 71st IEEE Electronic Components and Technology Conference (ECTC), Jun. 1–Jul. 4, 2021, pp.673–679. DOI: https://doi.org/10.1109/ECTC32696.2021.00117.
Jiang X, Zuo F, Wang S, Zhou X, Wang Y, Liu Q, Ren Q, Liu M. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-State Circuits Letters, 2022, 5: 110–113. DOI: https://doi.org/10.1109/LSSC.2022.3171862.
Bose B, Thakkar I. Characterization and mitigation of electromigration effects in TSV-based power delivery network enabled 3D-stacked DRAMs. In Proc. the 31st Great Lakes Symposium on VLSI, Jun. 2021, pp.101–107. DOI: https://doi.org/10.1145/3453688.3461503.
Agarwalla B, Das S, Sahu N. Process variation aware DRAM-Cache resizing. Journal of Systems Architecture, 2022, 123: 102364. DOI: https://doi.org/10.1016/j.sysarc.2021.102364.
Cheng W, Cai R, Zeng L, Feng D, Brinkmann A, Wang Y. IMCI: An efficient fingerprint retrieval approach based on 3D stacked memory. Science China Information Sciences, 2020, 63: 179101. DOI: https://doi.org/10.1007/s11432-019-2672-5.
Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.38–50. DOI: https://doi.org/10.1109/MICRO.2014.36.
Jiang S, Chen F, Zhang X. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proc. the 2005 Annual Conference on USENIX Annual Technical Conference, Apr. 2005.
Janapsatya A, Ignjatović A, Peddersen J, Parameswaran S. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), Mar. 2010, pp.920–925. DOI: https://doi.org/10.1109/DATE.2010.5456920.
Bansal S, Modha D S. CAR: Clock with adaptive replacement. In Proc. the 3rd USENIX Conference on File and Storage Technologies (FAST), Mar. 2004, pp.187–200.
Li C. CLOCK-pro+: Improving CLOCK-pro cache replacement with utility-driven adaptation. In Proc. the 12th ACM International Conference on Systems and Storage (SYSTOR), May 2019, pp.1–7. DOI: https://doi.org/10.1145/3319647.3325838.
Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: https://doi.org/10.1145/2024716.2024718.
Poremba M, Xie Y. NVMain: An architectural-level main memory simulator for emerging non-volatile memories. In Proc. the 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Aug. 2012, pp.392–397. DOI: https://doi.org/10.1109/ISVLSI.2012.82.
Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: A scalable and effective die-stacked DRAM cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.25–37. DOI: https://doi.org/10.1109/MICRO.2014.51.
Chou C C, Jaleel A, Qureshi M K. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.1–12. DOI: https://doi.org/10.1109/MICRO.2014.63.
Sim J, Loh G H, Kim H, OConnor M, Thottethodi M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.247–257. DOI: https://doi.org/10.1109/MICRO.2012.31.
Young V, Chishti Z A, Qureshi M K. TicToc: Enabling bandwidth-efficient DRAM caching for both hits and misses in hybrid memory systems. In Proc. the 37th IEEE International Conference on Computer Design (ICCD), Nov. 2019, pp.341–349. DOI: https://doi.org/10.1109/ICCD46524.2019.00055.
Zhang M, Kim J G, Yoon S K, Kim S D. Dynamic recognition prefetch engine for DRAM-PCM hybrid main memory. The Journal of Supercomputing, 2022, 78(2): 1885–1902. DOI: https://doi.org/10.1007/s11227-021-03948-5.
Choi S G, Kim J G, Kim S D. Adaptive granularity based last-level cache prefetching method with eDRAM prefetch buffer for graph processing applications. Applied Sciences, 2021, 11(3): 991. DOI: https://doi.org/10.3390/app11030991.
Kilic O O, Tallent N R, Friese R D. Rapid memory footprint access diagnostics. In Proc. the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug. 2020, pp.273–284. DOI: https://doi.org/10.1109/ISPASS48437.2020.00047.
Oh Y S, Chung E Y. Energy-efficient shared cache using way prediction based on way access dominance detection. IEEE Access, 2021, 9: 155048–155057. DOI: https://doi.org/10.1109/ACCESS.2021.3126739.
Jang H, Lee Y, Kim J, Kim Y, Kim J, Jeong J, Lee J W. Efficient footprint caching for Tagless DRAM Caches. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Mar. 2016, pp.237–248. DOI: https://doi.org/10.1109/HPCA.2016.7446068.
Tsukada S, Takayashiki H, Sato M, Komatsu K, Kobayashi H. A metadata prefetching mechanism for hybrid memory architectures. IEICE Trans. Electronics, 2022, E105.C(6): 232–243. DOI: https://doi.org/10.1587/transele.2021LHP0004.
Young V, Chou C, Jaleel A, Qureshi M. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proc. the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2018, pp.328–339. DOI: https://doi.org/10.1109/ISCA.2018.00036.
Chen P, Yue J, Liao X, Jin H. Trade-off between hit rate and hit latency for optimizing DRAM cache. IEEE Trans. Emerging Topics in Computing, 2021, 9(1): 55–64. DOI: https://doi.org/10.1109/TETC.2018.2800721.
Vasilakis E, Papaefstathiou V, Trancoso P, Sourdis I. Decoupled fused cache: Fusing a decoupled LLC with a DRAM cache. ACM Trans. Architecture and Code Optimization (TACO), 2018, 15(4): 65. DOI: https://doi.org/10.1145/3293447.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported jointly by the National Key Research and Development Program of China under Grant No. 2022YFB4500303, and the National Natural Science Foundation of China under Grant Nos. 62072198, 61825202, and 61929103.
Ye Chi received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2023. He is now working at the School of Big Data and Internet, Shenzhen Technology University (SZTU), Shenzhen. His search interests are in the areas of computer architecture, die-stacked DRAM, in-memory computing, hybrid memory system architecture and memory pooling.
Ren-Tong Guo received his B.E. degree in software engineering from Xi’an University of Science and Technology, Xi’an, in 2011, and his Ph.D. degree in computer science and engineering from Huazhong University of Science and Technology (HUST), Wuhan, in 2017. His research interests are in the areas of caching systems and distributed systems.
Xiao-Fei Liao is a professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), Wuhan. He received his Ph.D. degree in computer science and engineering from HUST, Wuhan, in 2005. His research interests are in the areas of system software, P2P system, cluster computing, and streaming services.
Hai-Kun Liu received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2012. He is a professor at the School of Computer Science and Technology, HUST, Wuhan. His current research interests include in-memory computing, virtualization technologies, cloud computing, and distributed systems.
Jianhui Yue received his Ph.D. degree from the University of Maine, Orono, in 2012. He is an assistant professor of the Computer Science Department, Michigan Technological University, Michigan. His research interests include computer architecture and systems.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Chi, Y., Guo, RT., Liao, XF. et al. P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings. J. Comput. Sci. Technol. 39, 1341–1360 (2024). https://doi.org/10.1007/s11390-023-2561-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-023-2561-y