Abstract
Recent advances in 3D fabrication have allowed the development of 3D memory over the logic die. The 3D memory presents itself as a viable solution to the memory wall problem. The 3D memory has stacked DRAM layers connected with Through Silicon Vias (TSVs). In the coming future, data-intensive applications on memory-centric network architecture will rely on packet-based communication to ensure scalability and reliable data transfers. The paper studies the performance of 3D memory that uses a packet-based communication protocol for communication between the CPU and off-chip memory. Our study provides insight into the internal flit traffic for different configurations of 3D memory when observed under the diverse memory access patterns and workload characteristics. We use CasHMC to capture the effect on performance for packet-based communication protocol and have integrated it with the gem5 simulator to study the workloads from Rodinia Benchmarks Suite. Our evaluation focuses on the following metrics- total memory bandwidth utilization, off-chip link bandwidth utilization, latency, & power consumption. We look at the performance characteristics of the 3D stacked memory, under the variation of the number of banks and vaults in the structure. Further, the effect of varying the packet size and the number of communication links on off-chip link bandwidth and latency have been studied. We further examine different off-chip link power optimization strategies. Finally, we observe the impact of varying buffer sizes on the latency at the off-chip links buffer and at the vault buffer of 3D memory. Our study provides more perspective into further developments of Data Centric Computing architectures and insight into proper flit management strategies in future memory architectures.














Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Reinsel D, Gantz J, Rydning J et al (2018) The digitization of the world from edge to core. Framingham Int Data Corporat 16
Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. ACM SIGARCH Comp Archit News 23(1):20–24
Ahn J, Yoo S, Mutlu O, Choi K (2015) Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In: 2015 ACM/IEEE 42nd annual international symposium on computer architecture (ISCA), pp. 336–348. IEEE
Hadidi R, Nai L, Kim H, Kim H (2017) Cairo: a compiler-assisted technique for enabling instruction-level offloading of processing-in-memory. ACM Trans Archit Code Optimizat (TACO) 14(4):1–25
Zhang C, Meng T, Sun G (2018) Pm3: power modeling and power management for processing-in-memory. In: 2018 IEEE International symposium on high performance computer architecture (HPCA), pp. 558–570. IEEE
Pawlowski JT (2011) Hybrid memory cube (hmc). In: 2011 IEEE hot chips 23 symposium (HCS), pp. 1–24. https://doi.org/10.1109/HOTCHIPS.2011.7477494
Macri J (2015) Amd’s next generation gpu and high bandwidth memory architecture: fury. In: 2015 IEEE hot chips 27 symposium (HCS), pp. 1–26. https://doi.org/10.1109/HOTCHIPS.2015.7477461
Samsung speeds a with processing in memory. IEEE Spectrum
Kim G, Kim J, Ahn JH, Kim J (2013) Memory-centric system interconnect design with hybrid memory cubes. In: Proceedings of the 22nd international Conference on Parallel Architectures and Compilation Techniques, pp. 145–155. IEEE
Penney DD, Chen L (2019) A survey of machine learning applied to computer architecture design. arXiv preprint arXiv:1909.12373
DiTomaso D, Sikder A, Kodi A, Louri A (2017) Machine learning enabled power-aware network-on-chip design. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1354–1359. IEEE
Henessey J, Patterson D (1990) Computer architecture: a quantitative approach mogran kaufman publishers. Palo Alto, CA
Consortium H et al (2013) Hybrid memory cube specification 2.1. Retrieved from hybridmemorycube.org. https://www.nuvation.com/sites/default/files/Nuvation-Engineering-Images/Articles/FPGAs-and-HMC/HMC-30G-VSR_HMCC_Specification.pdf
Hadidi R, Asgari B, Mudassar BA, Mukhopadhyay S, Yalamanchili S, Kim H (2017) Demystifying the characteristics of 3d-stacked memories: a case study for hybrid memory cube. In: 2017 IEEE international symposium on workload characterization (IISWC). IEEE
Menon S, Murugan VI (2020) Validating and characterizing a 2.5d high bandwidth memory subsystem. In: 2020 IEEE International Test Conference India, pp. 1–9. https://doi.org/10.1109/ITCIndia49857.2020.9171795
Glew A (1998) MLP yes! ILP no. ASPLOS wild and crazy idea session 98
Chou Y, Fahs B, Abraham S (2004) Microarchitecture optimizations for exploiting memory-level parallelism. In: Proceedings. 31st annual international symposium on computer architecture, 2004., pp. 76–87 (2004). IEEE
Khan K, Pasricha S, Kim RG (2020) A survey of resource management for processing-in-memory and near-memory processing architectures. J Low Power Electr Appl. https://doi.org/10.3390/jlpea10040030
Rosenfeld P, Cooper-Balis E, Farrell T, Resnick D, Jacob B (2012) Peering over the memory wall: design space and performance analysis of the hybrid memory cube. Univ. of Maryland Systems and Computer Architecture Group, Tech. Rep. UMD-SCA-2012-10-01
Cabarcas F, Rico A, Etsion Y, Ramirez A (2010) Interleaving granularity on high bandwidth memory architecture for cmps. In: 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, pp. 250–257 https://doi.org/10.1109/ICSAMOS.2010.5642060
Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent offloading and mapping (tom) enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Comput Archit News 44(3):204–216
Loh GH (2008) 3d-stacked memory architectures for multi-core processors. ACM SIGARCH Comput Archit News 36(3):453–464
Ibrahim KZ, Fatollahi-Fard F, Donofrio D, Shalf J (2016) Characterizing the performance of hybrid memory cube using apexmap application probes. In: Proceedings of the second international symposium on memory systems, pp. 429–436
Hadidi R, Asgari B, Young J, Mudassar BA, Garg K, Krishna T, Kim H (2018) Performance implications of NOCS on 3d-stacked memories: insights from the hybrid memory cube. In: 2018 ISPASS. IEEE
Gokhale M, Lloyd S, Macaraeg C (2015) Hybrid memory cube performance characterization on data-centric workloads. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. IA3 ’15. Association for Computing Machinery, New York, NY, USA
Chen R, Singapura SG, Prasanna VK (2017) Optimal dynamic data layouts for 2d FFT on 3d memory integrated FPGA. J Supercomput 73(2):652–663
Oliveira G, Gómez-Luna J, Orosa L, Ghose S, Vijaykumar N, Fernandez I, Sadrosadati M, Mutlu O (2021) A new methodology and open-source benchmark suite for evaluating data movement bottlenecks: a near-data processing case study. In: SIGMETRICS
Herruzo JM, Fernandez I, González-Navarro S, Plata O (2021) Enabling fast and energy-efficient FM-index exact matching using processing-near-memory. J Supercomput 77(9):10226–10251
Zhang J, Khoram S, Li J (2017) Boosting the performance of fpga-based graph processor using hybrid memory cube: a case for breadth first search. Association for Computing Machinery
Wang X, Leidel JD, Chen Y (2018) Memory coalescing for hybrid memory cube. In: Proceedings of the 47th International Conference on Parallel Processing. ICPP. Association for Computing Machinery
Schmidt J, Fröning H, Brüning U (2016) Exploring time and energy for complex accesses to a hybrid memory cube. In: Proceedings of the Second international symposium on memory systems https://doi.org/10.1145/2989081.2989099
Yu C, Liu S, Khan S (2021) Multipim: a detailed and configurable multi-stack processing-in-memory simulator. IEEE Comput Archit Lett 20(1):54–57. https://doi.org/10.1109/LCA.2021.3061905
Huang J, Reddy Puli R, Majumder P, Kim S, Boyapati R, Yum KH, Kim EJ (2019) Active-routing: Compute on the way for near-data processing. In: 2019 IEEE International symposium on high performance computer architecture (HPCA), pp. 674–686 https://doi.org/10.1109/HPCA.2019.00018
Mutlu O, Ghose S, Gómez-Luna J, Ausavarungnirun R (2019) Processing data where it makes sense: enabling in-memory computation. Microprocess Microsys 67:28–41
Pugsley SH, Jestes J, Zhang H, Balasubramonian R, Srinivasan V, Buyuktosunoglu A, Davis A, Li F (2014) NDC: analyzing the impact of 3d-stacked memory+ logic devices on mapreduce workloads. In: 2014 ISPASS, pp. 190–200. IEEE
Jeddeloh J, Keeth B (2012) Hybrid memory cube new dram architecture increases density and performance. In: 2012 Symposium on VLSI Technology (VLSIT)
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: 2009 IEEE International symposium on workload characterization (IISWC). Ieee
Jeon D-I, Chung K-S (2016) Cashmc: a cycle-accurate simulator for hybrid memory cube. IEEE Comput Archit Lett 16(1):10–13
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7
Cheveresan R, Ramsay M, Feucht C, Sharapov I (2007) Characteristics of workloads used in high performance and technical computing. In: Proceedings of the 21st Annual International Conference on Supercomputing. ICS ’07, pp. 73–82. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1274971.1274984
Rosenfeld P (2014) Performance exploration of the hybrid memory cube. PhD thesis
Lee J, Kim H, Vuduc R (2012) When prefetching works, when it doesn’t, and why. ACM Transact Archit Code Optimiz (TACO) 9(1):1–29
Ahn J, Yoo S, Choi K (2016) Low-power hybrid memory cubes with link power management and two-level prefetching. IEEE Transact Very Large Scale Integrat (VLSI) Systems. https://doi.org/10.1109/TVLSI.2015.2420315
Technical Introduction to Bufferbloat. https://www.bufferbloat.net/projects/
Medhi J (2002) Stochastic models in queueing theory. Elsevier, Armsterdam
Gulur N et al (2014) Anatomy: An analytical model of memory system performance. ACM SIGMETRICS Performance Eval. Review
Flynn M (2007) Computer architecture. Wiley, New Jersey
Gandhi A et al. (2013) Exact analysis of the m/m/k/setup class of markov chains via recursive renewal reward. In: ACM International Conference on Measurement and Modeling of Computer Systems
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pandey, S., Venkatesh, T.G. Performance investigation of packet-based communication in 3D-memories. J Supercomput 78, 19070–19096 (2022). https://doi.org/10.1007/s11227-022-04605-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04605-1