P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings

Chi, Ye; Guo, Ren-Tong; Liao, Xiao-Fei; Liu, Hai-Kun; Yue, Jianhui

doi:10.1007/s11390-023-2561-y

P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings

Regular Paper
Computer Architecture and Systems
Published: 16 January 2025

Volume 39, pages 1341–1360, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Ye Chi (池也)^1,2,3,4,5,
Ren-Tong Guo (郭人通)^1,2,3,4,
Xiao-Fei Liao (廖小飞)^1,2,3,4,
Hai-Kun Liu (刘海坤)^1,2,3,4 &
…
Jianhui Yue (岳建辉)⁶

40 Accesses
1 Altmetric
Explore all metrics

Abstract

Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Jun H, Cho J, Lee K, Son H Y, Kim K, Jin H, Kim K. HBM (high bandwidth memory) DRAM technology and architecture. In Proc. the 2017 IEEE International Memory Workshop (IMW), May 2017, pp.1–4. DOI: https://doi.org/10.1109/IMW.2017.7939084.
MATH Google Scholar
Hadidi R, Asgari B, Mudassar B A, Mukhopadhyay S, Yalamanchili S, Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Proc. the 2017 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2017, pp.66–75. DOI: https://doi.org/10.1109/IISWC.2017.8167757.
Chapter Google Scholar
Shahab A, Zhu M, Margaritov A, Grot B. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2018, pp.559–572. DOI: https://doi.org/10.1109/MICRO.2018.00052.
Google Scholar
Volos S, Jevdjic D, Falsafi B, Grot B. Fat caches for scale-out servers. IEEE Micro, 2017, 37(2): 90–103. DOI: https://doi.org/10.1109/MM.2017.32.
Article Google Scholar
Nassif N, Munch A O, Molnar C L, Pasdast G, Lyer S V, Yang Z, Mendoza O, Huddart M, Venkataraman S, Kandula S, Marom R, Kern A M, Bowhill B, Mulvihill D R, Nimmagadda S, Kalidindi V, Krause J, Haq M M, Sharma R, Duda K. Sapphire rapids: The next-generation intel Xeon scalable processor. In Proc. the 17th IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2022, pp.44–46. DOI: https://doi.org/10.1109/ISSCC42614.2022.9731107.
Google Scholar
Zahran M. The future of high-performance computing. In Proc. the 17th International Computer Engineering Conference (ICENCO), Dec. 2021, pp.129–134. DOI: https://doi.org/10.1109/ICENCO49852.2021.9698918.
MATH Google Scholar
Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.454–464. DOI: https://doi.org/10.1145/2155620.2155673.
Chapter MATH Google Scholar
Loh G, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78. DOI: https://doi.org/10.1109/MM.2012.25.
Article MATH Google Scholar
Qureshi M K, Loh G H. Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.235–246. DOI: https://doi.org/10.1109/MICRO.2012.30.
Google Scholar
Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41(3): 404–415. DOI: https://doi.org/10.1145/2508148.2485957.
Article Google Scholar
Shin D, Jang H, Oh K, Lee J W. An energy-efficient dram cache architecture for mobile platforms with PCM-based main memory. ACM Trans. Embedded Computing Systems (TECS), 2022, 21(1): 1–22. DOI: https://doi.org/10.1145/3451995.
Article MATH Google Scholar
Zhang Q, Sui X, Hou R, Zhang L. Line-coalescing DRAM cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449. DOI: https://doi.org/10.1016/j.suscom.2020.100449.
Google Scholar
Zhou F, Wu S, Yue J, Jin H, Shen J. Object Fingerprint Cache for Heterogeneous Memory System. IEEE Transactions on Computers, 2023, 72(9): 2496–2507. DOI: https://doi.org/10.1109/TC.2023.3251852.
Article MATH Google Scholar
Chi Y, Yue J, Liao X, Liu H, Jin H. A hybrid memory architecture supporting fine-grained data migration. Frontiers of Computer Science, 2024, 18(2): 182103. DOI: https://doi.org/10.1007/s11704-023-2675-y.
Article MATH Google Scholar
Hameed F, Bauer L, Henkel J. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(4): 651–664. DOI: https://doi.org/10.1109/TCAD.2015.2488488.
Article MATH Google Scholar
Hameed F, Bauer L, Henkel J. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In Proc. the 16th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept. 29–Oct. 4, 2013. DOI: https://doi.org/10.1109/CASES.2013.6662515.
MATH Google Scholar
Behnam P, Bojnordi M N. Adaptively reduced DRAM caching for energy-efficient high bandwidth memory. IEEE Trans. Computers, 2022, 71(10): 2675–2686. DOI: https://doi.org/10.1109/TC.2022.3140897.
Article MATH Google Scholar
Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.376–388. DOI: https://doi.org/10.1109/MICRO.2012.42.
MATH Google Scholar
Huang C C, Nagarajan V. ATCache: Reducing DRAM cache latency via a small SRAM tag cache. In Proc. the 23rd International Conference on Parallel Architectures and Compilation (PACT), Aug. 2014, pp.51–60. DOI: https://doi.org/10.1145/2628071.2628089.
Chapter MATH Google Scholar
Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In Proc. the 51st Annual Design Automation Conference (DAC), Jun. 2014. DOI: https://doi.org/10.1145/2593069.2593197.
MATH Google Scholar
Chou C, Jaleel A, Qureshi M K. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Computer Architecture News, 2015, 43(3S): 198–210. DOI: https://doi.org/10.1145/2872887.2750387.
Article Google Scholar
Hameed F, Khan A A, Castrillon J. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Computers, 2021, 70(11): 1914–1927. DOI: https://doi.org/10.1109/TC.2020.3029615.
Article MATH Google Scholar
Kawano M, Wang X Y, Ren Q, Loh W L, Rao B C, Chui K J. One-step TSV process development for 4-layer wafer stacked DRAM. In Proc. the 71st IEEE Electronic Components and Technology Conference (ECTC), Jun. 1–Jul. 4, 2021, pp.673–679. DOI: https://doi.org/10.1109/ECTC32696.2021.00117.
Google Scholar
Jiang X, Zuo F, Wang S, Zhou X, Wang Y, Liu Q, Ren Q, Liu M. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-State Circuits Letters, 2022, 5: 110–113. DOI: https://doi.org/10.1109/LSSC.2022.3171862.
Article Google Scholar
Bose B, Thakkar I. Characterization and mitigation of electromigration effects in TSV-based power delivery network enabled 3D-stacked DRAMs. In Proc. the 31st Great Lakes Symposium on VLSI, Jun. 2021, pp.101–107. DOI: https://doi.org/10.1145/3453688.3461503.
MATH Google Scholar
Agarwalla B, Das S, Sahu N. Process variation aware DRAM-Cache resizing. Journal of Systems Architecture, 2022, 123: 102364. DOI: https://doi.org/10.1016/j.sysarc.2021.102364.
Article Google Scholar
Cheng W, Cai R, Zeng L, Feng D, Brinkmann A, Wang Y. IMCI: An efficient fingerprint retrieval approach based on 3D stacked memory. Science China Information Sciences, 2020, 63: 179101. DOI: https://doi.org/10.1007/s11432-019-2672-5.
Article Google Scholar
Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.38–50. DOI: https://doi.org/10.1109/MICRO.2014.36.
Google Scholar
Jiang S, Chen F, Zhang X. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proc. the 2005 Annual Conference on USENIX Annual Technical Conference, Apr. 2005.
Google Scholar
Janapsatya A, Ignjatović A, Peddersen J, Parameswaran S. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), Mar. 2010, pp.920–925. DOI: https://doi.org/10.1109/DATE.2010.5456920.
Chapter Google Scholar
Bansal S, Modha D S. CAR: Clock with adaptive replacement. In Proc. the 3rd USENIX Conference on File and Storage Technologies (FAST), Mar. 2004, pp.187–200.
MATH Google Scholar
Li C. CLOCK-pro+: Improving CLOCK-pro cache replacement with utility-driven adaptation. In Proc. the 12th ACM International Conference on Systems and Storage (SYSTOR), May 2019, pp.1–7. DOI: https://doi.org/10.1145/3319647.3325838.
MATH Google Scholar
Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: https://doi.org/10.1145/2024716.2024718.
Article Google Scholar
Poremba M, Xie Y. NVMain: An architectural-level main memory simulator for emerging non-volatile memories. In Proc. the 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Aug. 2012, pp.392–397. DOI: https://doi.org/10.1109/ISVLSI.2012.82.
Chapter MATH Google Scholar
Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: A scalable and effective die-stacked DRAM cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.25–37. DOI: https://doi.org/10.1109/MICRO.2014.51.
Google Scholar
Chou C C, Jaleel A, Qureshi M K. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.1–12. DOI: https://doi.org/10.1109/MICRO.2014.63.
MATH Google Scholar
Sim J, Loh G H, Kim H, OConnor M, Thottethodi M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.247–257. DOI: https://doi.org/10.1109/MICRO.2012.31.
Google Scholar
Young V, Chishti Z A, Qureshi M K. TicToc: Enabling bandwidth-efficient DRAM caching for both hits and misses in hybrid memory systems. In Proc. the 37th IEEE International Conference on Computer Design (ICCD), Nov. 2019, pp.341–349. DOI: https://doi.org/10.1109/ICCD46524.2019.00055.
Google Scholar
Zhang M, Kim J G, Yoon S K, Kim S D. Dynamic recognition prefetch engine for DRAM-PCM hybrid main memory. The Journal of Supercomputing, 2022, 78(2): 1885–1902. DOI: https://doi.org/10.1007/s11227-021-03948-5.
Article MATH Google Scholar
Choi S G, Kim J G, Kim S D. Adaptive granularity based last-level cache prefetching method with eDRAM prefetch buffer for graph processing applications. Applied Sciences, 2021, 11(3): 991. DOI: https://doi.org/10.3390/app11030991.
Article MATH Google Scholar
Kilic O O, Tallent N R, Friese R D. Rapid memory footprint access diagnostics. In Proc. the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug. 2020, pp.273–284. DOI: https://doi.org/10.1109/ISPASS48437.2020.00047.
MATH Google Scholar
Oh Y S, Chung E Y. Energy-efficient shared cache using way prediction based on way access dominance detection. IEEE Access, 2021, 9: 155048–155057. DOI: https://doi.org/10.1109/ACCESS.2021.3126739.
Article Google Scholar
Jang H, Lee Y, Kim J, Kim Y, Kim J, Jeong J, Lee J W. Efficient footprint caching for Tagless DRAM Caches. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Mar. 2016, pp.237–248. DOI: https://doi.org/10.1109/HPCA.2016.7446068.
Chapter MATH Google Scholar
Tsukada S, Takayashiki H, Sato M, Komatsu K, Kobayashi H. A metadata prefetching mechanism for hybrid memory architectures. IEICE Trans. Electronics, 2022, E105.C(6): 232–243. DOI: https://doi.org/10.1587/transele.2021LHP0004.
Article MATH Google Scholar
Young V, Chou C, Jaleel A, Qureshi M. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proc. the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2018, pp.328–339. DOI: https://doi.org/10.1109/ISCA.2018.00036.
MATH Google Scholar
Chen P, Yue J, Liao X, Jin H. Trade-off between hit rate and hit latency for optimizing DRAM cache. IEEE Trans. Emerging Topics in Computing, 2021, 9(1): 55–64. DOI: https://doi.org/10.1109/TETC.2018.2800721.
MATH Google Scholar
Vasilakis E, Papaefstathiou V, Trancoso P, Sourdis I. Decoupled fused cache: Fusing a decoupled LLC with a DRAM cache. ACM Trans. Architecture and Code Optimization (TACO), 2018, 15(4): 65. DOI: https://doi.org/10.1145/3293447.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Research Center for Big Data Technology and System, Wuhan, 430074, China
Ye Chi (池也), Ren-Tong Guo (郭人通), Xiao-Fei Liao (廖小飞) & Hai-Kun Liu (刘海坤)
Services Computing Technology and System Laboratory, Wuhan, 430074, China
Ye Chi (池也), Ren-Tong Guo (郭人通), Xiao-Fei Liao (廖小飞) & Hai-Kun Liu (刘海坤)
Cluster and Grid Computing Laboratory, Wuhan, 430074, China
Ye Chi (池也), Ren-Tong Guo (郭人通), Xiao-Fei Liao (廖小飞) & Hai-Kun Liu (刘海坤)
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Ye Chi (池也), Ren-Tong Guo (郭人通), Xiao-Fei Liao (廖小飞) & Hai-Kun Liu (刘海坤)
School of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China
Ye Chi (池也)
Department of Computer Science, Michigan Technological University, Houghton, 49931-1295, USA
Jianhui Yue (岳建辉)

Authors

Ye Chi (池也)
View author publications
You can also search for this author in PubMed Google Scholar
Ren-Tong Guo (郭人通)
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Fei Liao (廖小飞)
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Kun Liu (刘海坤)
View author publications
You can also search for this author in PubMed Google Scholar
Jianhui Yue (岳建辉)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Fei Liao (廖小飞).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported jointly by the National Key Research and Development Program of China under Grant No. 2022YFB4500303, and the National Natural Science Foundation of China under Grant Nos. 62072198, 61825202, and 61929103.

Ye Chi received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2023. He is now working at the School of Big Data and Internet, Shenzhen Technology University (SZTU), Shenzhen. His search interests are in the areas of computer architecture, die-stacked DRAM, in-memory computing, hybrid memory system architecture and memory pooling.

Ren-Tong Guo received his B.E. degree in software engineering from Xi’an University of Science and Technology, Xi’an, in 2011, and his Ph.D. degree in computer science and engineering from Huazhong University of Science and Technology (HUST), Wuhan, in 2017. His research interests are in the areas of caching systems and distributed systems.

Xiao-Fei Liao is a professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), Wuhan. He received his Ph.D. degree in computer science and engineering from HUST, Wuhan, in 2005. His research interests are in the areas of system software, P2P system, cluster computing, and streaming services.

Hai-Kun Liu received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2012. He is a professor at the School of Computer Science and Technology, HUST, Wuhan. His current research interests include in-memory computing, virtualization technologies, cloud computing, and distributed systems.

Jianhui Yue received his Ph.D. degree from the University of Maine, Orono, in 2012. He is an assistant professor of the Computer Science Department, Michigan Technological University, Michigan. His research interests include computer architecture and systems.

Electronic Supplementary Material