research-article

Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore

Authors:

Syed Minhaj Hassan,

Sudhakar Yalamanchili,

Saibal MukhopadhyayAuthors Info & Claims

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

Pages 11 - 21

https://doi.org/10.1145/2818950.2818952

Published: 05 October 2015 Publication History

Abstract

A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.

References

[1]

G. Allan. Ddr4 bank groups in embedded applications. May 2013.

[2]

J. G. Beu, M. C. Rosier, and T. M. Conte. Manager-client pairing: a framework for implementing coherence hierarchies. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 226--236, New York, NY, USA, 2011. ACM.

Digital Library

[3]

E. Beyne. 3d system integration technologies. In VLSI Technology, Systems, and Applications, 2006 International Symposium on, pages 1--9, april 2006.

[4]

C. C. Chou, A. Jaleel, and M. K. Qureshi. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 1--12. IEEE, 2014.

Digital Library

[5]

W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.

Digital Library

[6]

S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating locality vs. parallelism tradeoffs in multiple memory controller environments. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 187--188, oct. 2011.

Digital Library

[7]

C. J. Lira and A.GonzÃąlez. Analysis of non-uniform cache architecture policies for chip-multiprocessors using the parsec benchmark suite. In In Proceedings of the 2nd Workshop on Managed Many-Core Systems, MMCS'09, Washington D.C, (USA), March 2009.

[8]

Kersey, Chad. QSim - QEMU-based Emulation Library for Microarchitecture Simulation.

[9]

C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the processor-memory performance gap with 3d ic technology. Design Test of Computers, IEEE, 22(6):556 -- 564, nov.-dec. 2005.

Digital Library

[10]

G. H. Loh. 3d-stacked memory architectures for multi-core processors. In ACM SIGARCH computer architecture news, volume 36, pages 453--464. IEEE Computer Society, 2008.

Digital Library

[11]

M. Motoyoshi. Through-silicon via (tsv). Proceedings of the IEEE, 97(1):43--48, Jan 2009.

[12]

J. Pawlowski. Hybrid memory cube: Breakthrough dram performance with a fundamentally re-architected dram subsystem. In Hot Chips, 2011.

[13]

S. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Comparing implementations of near-data computing with in-memory mapreduce workloads. Micro, IEEE, 34(4):44--52, July 2014.

[14]

M. K. Qureshi and G. H. Loh. Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 235--246. IEEE Computer Society, 2012.

Digital Library

[15]

A. Ros, M. Acacio, and J. Garcia. Dico-cmp: Efficient cache coherency in tiled cmp architectures. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--11, april 2008.

[16]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. IEEE Comput. Archit. Lett., 10(1):16--19, Jan. 2011.

Digital Library

[17]

G. Sandhu. Dram scaling & bandwidth challenges. In NSF Workshop on Emerging Technologies for Interconnects (WETI), Washington, DC, USA, Feb 2012.

[18]

J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. Transparent hardware management of stacked dram as part of memory. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 13--24. IEEE, 2014.

Digital Library

[19]

Y. H. Son, O. Seongil, H. Yang, D. Jung, J. H. Ahn, J. Kim, J. Kim, and J. W. Lee. Microbank: Architecting through-silicon interposer-based main memory systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 1059--1070, Piscataway, NJ, USA, 2014. IEEE Press.

Digital Library

[20]

J. Standard. Wide i/o single data rate (wide i/o sdr) (jesd229). Dec. 2011.

[21]

J. Standard. Ddr4 sdram standard (ddr4) (jesd79-4a). Nov. 2013.

[22]

J. Standard. High bandwidth memory (hbm) dram (jesd235). Oct. 2013.

[23]

J. Standard. Low power double data rate 4 (lpddr4) (jesd209-4). August 2014.

[24]

J. Standard. Wide i/o 2 (wideio2) (jesd229-2). August 2014.

[25]

M. Walton. Hbm explained: Can stacked memory give amd the edge it needs? May 2015.

[26]

Z. Wan, H. Xiao, Y. Joshi, and S. Yalamanchili. Co-design of multicore architectures and microfluidic cooling for 3d stacked ics. Microelectronics Journal, 45(12):1814--1821, 2014.

Digital Library

[27]

J. Wang, J. Beu, R. Bheda, T. Conte, Z. Dong, C. Kersey, M. Rasquinha, G. Riley, W. Song, H. Xiao, P. Xu, and S. Yalamanchili. Manifold: A parallel simulation framework for multicore systems. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014.

[28]

G. H. L. Yasuko Eckert, Nuwan Jayasena. Thermal feasibility of die-stacked processing in memory. December 2014.

[29]

R. Yu. Foundry tsv enablement for 2.5d/3d chip stacking. August 2012.

[30]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 85--98. ACM, 2014.

Digital Library

[31]

D. H. K. et al, "3d-maps: 3d massively parallel processor with stacked memory," in Solid-State Circuits Conference (ISSCC), 2012 IEEE International, feb. 2012.

Cited By

Bostanci FOlgun AOrosa LYaglikci AKim JHassan HErgin OMutlu O(2022)DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00087(1141-1155)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00087
Mutlu OGhose SGómez-Luna JAusavarungnirun R(2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
https://doi.org/10.1007/978-981-16-7487-7_7
Olgun APatel MYağlikçi ALuo HKim JBostanci FVijaykumar NErgin OMutlu OMartínez JDuato JJohn L(2021)QUAC-TRNGProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00078(944-957)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00078
Show More Cited By

Index Terms

Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Exploiting Sequential and Temporal Localities to Improve Performance of NAND Flash-Based SSDs

NAND flash-based Solid-State Drives (SSDs) are becoming a viable alternative as a secondary storage solution for many computing systems. Since the physical characteristics of NAND flash memory are different from conventional Hard-Disk Drives (HDDs), ...
An Efficient near-Bank Processing Architecture for Personalized Recommendation System
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Personalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory ...
Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

October 2015

278 pages

ISBN:9781450336048

DOI:10.1145/2818950

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MEMSYS '15

MEMSYS '15: International Symposium on Memory Systems

October 5 - 8, 2015

DC, Washington DC, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
488
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bostanci FOlgun AOrosa LYaglikci AKim JHassan HErgin OMutlu O(2022)DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00087(1141-1155)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00087
Mutlu OGhose SGómez-Luna JAusavarungnirun R(2022)A Modern Primer on Processing in MemoryEmerging Computing: From Devices to Systems10.1007/978-981-16-7487-7_7(171-243)Online publication date: 9-Jul-2022
https://doi.org/10.1007/978-981-16-7487-7_7
Olgun APatel MYağlikçi ALuo HKim JBostanci FVijaykumar NErgin OMutlu OMartínez JDuato JJohn L(2021)QUAC-TRNGProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00078(944-957)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00078
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Asgari BMukhopadhyay SYalamanchili S(2020)MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory SystemJournal of Signal Processing Systems10.1007/s11265-019-01505-1Online publication date: 28-Feb-2020
https://doi.org/10.1007/s11265-019-01505-1
Ghose SBoroumand AKim JGomez-Luna JMutlu O(2019)Processing-in-memory: A workload-driven perspectiveIBM Journal of Research and Development10.1147/JRD.2019.293404863:6(3:1-3:19)Online publication date: 1-Nov-2019
https://doi.org/10.1147/JRD.2019.2934048
Jang JHeo JLee YWon JKim SJung SJang HHam TLee J(2019)CharonProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358297(726-739)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358297
Mutlu OGhose SGómez-Luna JAusavarungnirun R(2019)Enabling Practical Processing in and near Memory for Data-Intensive ComputingProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3323476(1-4)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3323476
Afoakwa RLu LWu HHuang M(2019)To Stack or Not To StackProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00017(110-123)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00017
Kim JPatel MHassan HOrosa LMutlu O(2019)D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00011(582-595)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00011
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents