Skip to main content
Log in

P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings

  • Regular Paper
  • Computer Architecture and Systems
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Jun H, Cho J, Lee K, Son H Y, Kim K, Jin H, Kim K. HBM (high bandwidth memory) DRAM technology and architecture. In Proc. the 2017 IEEE International Memory Workshop (IMW), May 2017, pp.1–4. DOI: https://doi.org/10.1109/IMW.2017.7939084.

    MATH  Google Scholar 

  2. Hadidi R, Asgari B, Mudassar B A, Mukhopadhyay S, Yalamanchili S, Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Proc. the 2017 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2017, pp.66–75. DOI: https://doi.org/10.1109/IISWC.2017.8167757.

    Chapter  Google Scholar 

  3. Shahab A, Zhu M, Margaritov A, Grot B. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2018, pp.559–572. DOI: https://doi.org/10.1109/MICRO.2018.00052.

    Google Scholar 

  4. Volos S, Jevdjic D, Falsafi B, Grot B. Fat caches for scale-out servers. IEEE Micro, 2017, 37(2): 90–103. DOI: https://doi.org/10.1109/MM.2017.32.

    Article  Google Scholar 

  5. Nassif N, Munch A O, Molnar C L, Pasdast G, Lyer S V, Yang Z, Mendoza O, Huddart M, Venkataraman S, Kandula S, Marom R, Kern A M, Bowhill B, Mulvihill D R, Nimmagadda S, Kalidindi V, Krause J, Haq M M, Sharma R, Duda K. Sapphire rapids: The next-generation intel Xeon scalable processor. In Proc. the 17th IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2022, pp.44–46. DOI: https://doi.org/10.1109/ISSCC42614.2022.9731107.

    Google Scholar 

  6. Zahran M. The future of high-performance computing. In Proc. the 17th International Computer Engineering Conference (ICENCO), Dec. 2021, pp.129–134. DOI: https://doi.org/10.1109/ICENCO49852.2021.9698918.

    MATH  Google Scholar 

  7. Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.454–464. DOI: https://doi.org/10.1145/2155620.2155673.

    Chapter  MATH  Google Scholar 

  8. Loh G, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78. DOI: https://doi.org/10.1109/MM.2012.25.

    Article  MATH  Google Scholar 

  9. Qureshi M K, Loh G H. Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.235–246. DOI: https://doi.org/10.1109/MICRO.2012.30.

    Google Scholar 

  10. Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41(3): 404–415. DOI: https://doi.org/10.1145/2508148.2485957.

    Article  Google Scholar 

  11. Shin D, Jang H, Oh K, Lee J W. An energy-efficient dram cache architecture for mobile platforms with PCM-based main memory. ACM Trans. Embedded Computing Systems (TECS), 2022, 21(1): 1–22. DOI: https://doi.org/10.1145/3451995.

    Article  MATH  Google Scholar 

  12. Zhang Q, Sui X, Hou R, Zhang L. Line-coalescing DRAM cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449. DOI: https://doi.org/10.1016/j.suscom.2020.100449.

    Google Scholar 

  13. Zhou F, Wu S, Yue J, Jin H, Shen J. Object Fingerprint Cache for Heterogeneous Memory System. IEEE Transactions on Computers, 2023, 72(9): 2496–2507. DOI: https://doi.org/10.1109/TC.2023.3251852.

    Article  MATH  Google Scholar 

  14. Chi Y, Yue J, Liao X, Liu H, Jin H. A hybrid memory architecture supporting fine-grained data migration. Frontiers of Computer Science, 2024, 18(2): 182103. DOI: https://doi.org/10.1007/s11704-023-2675-y.

    Article  MATH  Google Scholar 

  15. Hameed F, Bauer L, Henkel J. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(4): 651–664. DOI: https://doi.org/10.1109/TCAD.2015.2488488.

    Article  MATH  Google Scholar 

  16. Hameed F, Bauer L, Henkel J. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In Proc. the 16th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept. 29–Oct. 4, 2013. DOI: https://doi.org/10.1109/CASES.2013.6662515.

    MATH  Google Scholar 

  17. Behnam P, Bojnordi M N. Adaptively reduced DRAM caching for energy-efficient high bandwidth memory. IEEE Trans. Computers, 2022, 71(10): 2675–2686. DOI: https://doi.org/10.1109/TC.2022.3140897.

    Article  MATH  Google Scholar 

  18. Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.376–388. DOI: https://doi.org/10.1109/MICRO.2012.42.

    MATH  Google Scholar 

  19. Huang C C, Nagarajan V. ATCache: Reducing DRAM cache latency via a small SRAM tag cache. In Proc. the 23rd International Conference on Parallel Architectures and Compilation (PACT), Aug. 2014, pp.51–60. DOI: https://doi.org/10.1145/2628071.2628089.

    Chapter  MATH  Google Scholar 

  20. Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In Proc. the 51st Annual Design Automation Conference (DAC), Jun. 2014. DOI: https://doi.org/10.1145/2593069.2593197.

    MATH  Google Scholar 

  21. Chou C, Jaleel A, Qureshi M K. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Computer Architecture News, 2015, 43(3S): 198–210. DOI: https://doi.org/10.1145/2872887.2750387.

    Article  Google Scholar 

  22. Hameed F, Khan A A, Castrillon J. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Computers, 2021, 70(11): 1914–1927. DOI: https://doi.org/10.1109/TC.2020.3029615.

    Article  MATH  Google Scholar 

  23. Kawano M, Wang X Y, Ren Q, Loh W L, Rao B C, Chui K J. One-step TSV process development for 4-layer wafer stacked DRAM. In Proc. the 71st IEEE Electronic Components and Technology Conference (ECTC), Jun. 1–Jul. 4, 2021, pp.673–679. DOI: https://doi.org/10.1109/ECTC32696.2021.00117.

    Google Scholar 

  24. Jiang X, Zuo F, Wang S, Zhou X, Wang Y, Liu Q, Ren Q, Liu M. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-State Circuits Letters, 2022, 5: 110–113. DOI: https://doi.org/10.1109/LSSC.2022.3171862.

    Article  Google Scholar 

  25. Bose B, Thakkar I. Characterization and mitigation of electromigration effects in TSV-based power delivery network enabled 3D-stacked DRAMs. In Proc. the 31st Great Lakes Symposium on VLSI, Jun. 2021, pp.101–107. DOI: https://doi.org/10.1145/3453688.3461503.

    MATH  Google Scholar 

  26. Agarwalla B, Das S, Sahu N. Process variation aware DRAM-Cache resizing. Journal of Systems Architecture, 2022, 123: 102364. DOI: https://doi.org/10.1016/j.sysarc.2021.102364.

    Article  Google Scholar 

  27. Cheng W, Cai R, Zeng L, Feng D, Brinkmann A, Wang Y. IMCI: An efficient fingerprint retrieval approach based on 3D stacked memory. Science China Information Sciences, 2020, 63: 179101. DOI: https://doi.org/10.1007/s11432-019-2672-5.

    Article  Google Scholar 

  28. Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.38–50. DOI: https://doi.org/10.1109/MICRO.2014.36.

    Google Scholar 

  29. Jiang S, Chen F, Zhang X. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proc. the 2005 Annual Conference on USENIX Annual Technical Conference, Apr. 2005.

    Google Scholar 

  30. Janapsatya A, Ignjatović A, Peddersen J, Parameswaran S. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), Mar. 2010, pp.920–925. DOI: https://doi.org/10.1109/DATE.2010.5456920.

    Chapter  Google Scholar 

  31. Bansal S, Modha D S. CAR: Clock with adaptive replacement. In Proc. the 3rd USENIX Conference on File and Storage Technologies (FAST), Mar. 2004, pp.187–200.

    MATH  Google Scholar 

  32. Li C. CLOCK-pro+: Improving CLOCK-pro cache replacement with utility-driven adaptation. In Proc. the 12th ACM International Conference on Systems and Storage (SYSTOR), May 2019, pp.1–7. DOI: https://doi.org/10.1145/3319647.3325838.

    MATH  Google Scholar 

  33. Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: https://doi.org/10.1145/2024716.2024718.

    Article  Google Scholar 

  34. Poremba M, Xie Y. NVMain: An architectural-level main memory simulator for emerging non-volatile memories. In Proc. the 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Aug. 2012, pp.392–397. DOI: https://doi.org/10.1109/ISVLSI.2012.82.

    Chapter  MATH  Google Scholar 

  35. Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: A scalable and effective die-stacked DRAM cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.25–37. DOI: https://doi.org/10.1109/MICRO.2014.51.

    Google Scholar 

  36. Chou C C, Jaleel A, Qureshi M K. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.1–12. DOI: https://doi.org/10.1109/MICRO.2014.63.

    MATH  Google Scholar 

  37. Sim J, Loh G H, Kim H, OConnor M, Thottethodi M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.247–257. DOI: https://doi.org/10.1109/MICRO.2012.31.

    Google Scholar 

  38. Young V, Chishti Z A, Qureshi M K. TicToc: Enabling bandwidth-efficient DRAM caching for both hits and misses in hybrid memory systems. In Proc. the 37th IEEE International Conference on Computer Design (ICCD), Nov. 2019, pp.341–349. DOI: https://doi.org/10.1109/ICCD46524.2019.00055.

    Google Scholar 

  39. Zhang M, Kim J G, Yoon S K, Kim S D. Dynamic recognition prefetch engine for DRAM-PCM hybrid main memory. The Journal of Supercomputing, 2022, 78(2): 1885–1902. DOI: https://doi.org/10.1007/s11227-021-03948-5.

    Article  MATH  Google Scholar 

  40. Choi S G, Kim J G, Kim S D. Adaptive granularity based last-level cache prefetching method with eDRAM prefetch buffer for graph processing applications. Applied Sciences, 2021, 11(3): 991. DOI: https://doi.org/10.3390/app11030991.

    Article  MATH  Google Scholar 

  41. Kilic O O, Tallent N R, Friese R D. Rapid memory footprint access diagnostics. In Proc. the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug. 2020, pp.273–284. DOI: https://doi.org/10.1109/ISPASS48437.2020.00047.

    MATH  Google Scholar 

  42. Oh Y S, Chung E Y. Energy-efficient shared cache using way prediction based on way access dominance detection. IEEE Access, 2021, 9: 155048–155057. DOI: https://doi.org/10.1109/ACCESS.2021.3126739.

    Article  Google Scholar 

  43. Jang H, Lee Y, Kim J, Kim Y, Kim J, Jeong J, Lee J W. Efficient footprint caching for Tagless DRAM Caches. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Mar. 2016, pp.237–248. DOI: https://doi.org/10.1109/HPCA.2016.7446068.

    Chapter  MATH  Google Scholar 

  44. Tsukada S, Takayashiki H, Sato M, Komatsu K, Kobayashi H. A metadata prefetching mechanism for hybrid memory architectures. IEICE Trans. Electronics, 2022, E105.C(6): 232–243. DOI: https://doi.org/10.1587/transele.2021LHP0004.

    Article  MATH  Google Scholar 

  45. Young V, Chou C, Jaleel A, Qureshi M. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proc. the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2018, pp.328–339. DOI: https://doi.org/10.1109/ISCA.2018.00036.

    MATH  Google Scholar 

  46. Chen P, Yue J, Liao X, Jin H. Trade-off between hit rate and hit latency for optimizing DRAM cache. IEEE Trans. Emerging Topics in Computing, 2021, 9(1): 55–64. DOI: https://doi.org/10.1109/TETC.2018.2800721.

    MATH  Google Scholar 

  47. Vasilakis E, Papaefstathiou V, Trancoso P, Sourdis I. Decoupled fused cache: Fusing a decoupled LLC with a DRAM cache. ACM Trans. Architecture and Code Optimization (TACO), 2018, 15(4): 65. DOI: https://doi.org/10.1145/3293447.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Fei Liao  (廖小飞).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported jointly by the National Key Research and Development Program of China under Grant No. 2022YFB4500303, and the National Natural Science Foundation of China under Grant Nos. 62072198, 61825202, and 61929103.

Ye Chi received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2023. He is now working at the School of Big Data and Internet, Shenzhen Technology University (SZTU), Shenzhen. His search interests are in the areas of computer architecture, die-stacked DRAM, in-memory computing, hybrid memory system architecture and memory pooling.

Ren-Tong Guo received his B.E. degree in software engineering from Xi’an University of Science and Technology, Xi’an, in 2011, and his Ph.D. degree in computer science and engineering from Huazhong University of Science and Technology (HUST), Wuhan, in 2017. His research interests are in the areas of caching systems and distributed systems.

Xiao-Fei Liao is a professor in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), Wuhan. He received his Ph.D. degree in computer science and engineering from HUST, Wuhan, in 2005. His research interests are in the areas of system software, P2P system, cluster computing, and streaming services.

Hai-Kun Liu received his Ph.D. degree in computer science and technology from Huazhong University of Science and Technology (HUST), Wuhan, in 2012. He is a professor at the School of Computer Science and Technology, HUST, Wuhan. His current research interests include in-memory computing, virtualization technologies, cloud computing, and distributed systems.

Jianhui Yue received his Ph.D. degree from the University of Maine, Orono, in 2012. He is an assistant professor of the Computer Science Department, Michigan Technological University, Michigan. His research interests include computer architecture and systems.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chi, Y., Guo, RT., Liao, XF. et al. P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings. J. Comput. Sci. Technol. 39, 1341–1360 (2024). https://doi.org/10.1007/s11390-023-2561-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-023-2561-y

Keywords

Navigation