skip to main content
10.1145/2818950.2818952acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore

Published: 05 October 2015 Publication History

Abstract

A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.

References

[1]
G. Allan. Ddr4 bank groups in embedded applications. May 2013.
[2]
J. G. Beu, M. C. Rosier, and T. M. Conte. Manager-client pairing: a framework for implementing coherence hierarchies. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 226--236, New York, NY, USA, 2011. ACM.
[3]
E. Beyne. 3d system integration technologies. In VLSI Technology, Systems, and Applications, 2006 International Symposium on, pages 1--9, april 2006.
[4]
C. C. Chou, A. Jaleel, and M. K. Qureshi. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 1--12. IEEE, 2014.
[5]
W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[6]
S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating locality vs. parallelism tradeoffs in multiple memory controller environments. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 187--188, oct. 2011.
[7]
C. J. Lira and A.GonzÃąlez. Analysis of non-uniform cache architecture policies for chip-multiprocessors using the parsec benchmark suite. In In Proceedings of the 2nd Workshop on Managed Many-Core Systems, MMCS'09, Washington D.C, (USA), March 2009.
[8]
Kersey, Chad. QSim - QEMU-based Emulation Library for Microarchitecture Simulation.
[9]
C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the processor-memory performance gap with 3d ic technology. Design Test of Computers, IEEE, 22(6):556 -- 564, nov.-dec. 2005.
[10]
G. H. Loh. 3d-stacked memory architectures for multi-core processors. In ACM SIGARCH computer architecture news, volume 36, pages 453--464. IEEE Computer Society, 2008.
[11]
M. Motoyoshi. Through-silicon via (tsv). Proceedings of the IEEE, 97(1):43--48, Jan 2009.
[12]
J. Pawlowski. Hybrid memory cube: Breakthrough dram performance with a fundamentally re-architected dram subsystem. In Hot Chips, 2011.
[13]
S. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Comparing implementations of near-data computing with in-memory mapreduce workloads. Micro, IEEE, 34(4):44--52, July 2014.
[14]
M. K. Qureshi and G. H. Loh. Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 235--246. IEEE Computer Society, 2012.
[15]
A. Ros, M. Acacio, and J. Garcia. Dico-cmp: Efficient cache coherency in tiled cmp architectures. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--11, april 2008.
[16]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. IEEE Comput. Archit. Lett., 10(1):16--19, Jan. 2011.
[17]
G. Sandhu. Dram scaling & bandwidth challenges. In NSF Workshop on Emerging Technologies for Interconnects (WETI), Washington, DC, USA, Feb 2012.
[18]
J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. Transparent hardware management of stacked dram as part of memory. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 13--24. IEEE, 2014.
[19]
Y. H. Son, O. Seongil, H. Yang, D. Jung, J. H. Ahn, J. Kim, J. Kim, and J. W. Lee. Microbank: Architecting through-silicon interposer-based main memory systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 1059--1070, Piscataway, NJ, USA, 2014. IEEE Press.
[20]
J. Standard. Wide i/o single data rate (wide i/o sdr) (jesd229). Dec. 2011.
[21]
J. Standard. Ddr4 sdram standard (ddr4) (jesd79-4a). Nov. 2013.
[22]
J. Standard. High bandwidth memory (hbm) dram (jesd235). Oct. 2013.
[23]
J. Standard. Low power double data rate 4 (lpddr4) (jesd209-4). August 2014.
[24]
J. Standard. Wide i/o 2 (wideio2) (jesd229-2). August 2014.
[25]
M. Walton. Hbm explained: Can stacked memory give amd the edge it needs? May 2015.
[26]
Z. Wan, H. Xiao, Y. Joshi, and S. Yalamanchili. Co-design of multicore architectures and microfluidic cooling for 3d stacked ics. Microelectronics Journal, 45(12):1814--1821, 2014.
[27]
J. Wang, J. Beu, R. Bheda, T. Conte, Z. Dong, C. Kersey, M. Rasquinha, G. Riley, W. Song, H. Xiao, P. Xu, and S. Yalamanchili. Manifold: A parallel simulation framework for multicore systems. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014.
[28]
G. H. L. Yasuko Eckert, Nuwan Jayasena. Thermal feasibility of die-stacked processing in memory. December 2014.
[29]
R. Yu. Foundry tsv enablement for 2.5d/3d chip stacking. August 2012.
[30]
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 85--98. ACM, 2014.
[31]
D. H. K. et al, "3d-maps: 3d massively parallel processor with stacked memory," in Solid-State Circuits Conference (ISSCC), 2012 IEEE International, feb. 2012.

Cited By

View all
  • (2022)DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00087(1141-1155)Online publication date: Apr-2022
  • (2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
  • (2021)QUAC-TRNGProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00078(944-957)Online publication date: 14-Jun-2021
  • Show More Cited By

Index Terms

  1. Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems
    October 2015
    278 pages
    ISBN:9781450336048
    DOI:10.1145/2818950
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 October 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D memory system
    2. Address mapping
    3. HMC
    4. Interconnection network
    5. Near data computing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MEMSYS '15
    MEMSYS '15: International Symposium on Memory Systems
    October 5 - 8, 2015
    DC, Washington DC, USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00087(1141-1155)Online publication date: Apr-2022
    • (2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
    • (2021)QUAC-TRNGProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00078(944-957)Online publication date: 14-Jun-2021
    • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
    • (2020)MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory SystemJournal of Signal Processing Systems10.1007/s11265-019-01505-1Online publication date: 28-Feb-2020
    • (2019)Processing-in-memory: A workload-driven perspectiveIBM Journal of Research and Development10.1147/JRD.2019.293404863:6(3:1-3:19)Online publication date: 1-Nov-2019
    • (2019)CharonProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358297(726-739)Online publication date: 12-Oct-2019
    • (2019)Enabling Practical Processing in and near Memory for Data-Intensive ComputingProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3323476(1-4)Online publication date: 2-Jun-2019
    • (2019)To Stack or Not To StackProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00017(110-123)Online publication date: 23-Sep-2019
    • (2019)D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00011(582-595)Online publication date: Feb-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media