research-article

MAC: Memory Access Coalescer for 3D-Stacked Memory

Authors:

Antonino Tumeo,

John D. Leidel,

Yong ChenAuthors Info & Claims

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Article No.: 2, Pages 1 - 10

https://doi.org/10.1145/3337821.3337867

Published: 05 August 2019 Publication History

Abstract

Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth.

As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.

References

[1]

JEDEC Standard High Bandwidth Memory(HBM) DRAM Specification. Technical report, 2013.

[2]

Toward a New Metric for Ranking High Performance Computing Systems. Technical report, Sandia National Laboratories, 2013.

[3]

HMC Specification 2.1. Technical report, December 2015.

[4]

CUDA Toolkit Documentation. Technical report, July 2018.

[5]

S. Aga and S. Narayanasamy. Invisimem: Smart memory defenses for memory bus side channel. In ISCA 2017.

Digital Library

[6]

N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective gpu caches to eliminate cpu-gpu hw cache coherence. In HPCA 2016.

[7]

J. Ahn, S. Yoo, O. Mutlu, and K. Choi. Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In ISCA 2015.

Digital Library

[8]

D. Bader and K. Madduri. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors. HiPC 2005.

Digital Library

[9]

D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. NAS parallel benchmark results. SC 1992, Los Alamitos, CA, USA.

Digital Library

[10]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In CODES 2002.

Digital Library

[11]

S. Beamer, K. Asanovic, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.

[12]

S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In SC 2011.

[13]

P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, 2016.

Digital Library

[14]

H. Dai, C. Li, H. Zhou, S. Gupta, C. Kartsaklis, and M. Mantor. A model-driven approach to warp/thread-block level gpu cache bypassing. In DAC 2016.

Digital Library

[15]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In ICPP 2009.

Digital Library

[16]

N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on gpus. In CGO 2015.

[17]

M. Gao, G. Ayers, and C. Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In PACT 2015.

Digital Library

[18]

M. Gokhale, S. Lloyd, and C. Macaraeg. Hybrid memory cube performance characterization on data-centric workloads. In IA3 2015.

Digital Library

[19]

E. H. Gornish, E. D. Granston, and A. V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In ICS 2014.

Digital Library

[20]

R. Hadidi, B. Asgari, B. A. Mudassar, S. Mukhopadhyay, S. Yalamanchili, and H. Kim. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Workload Characterization (IISWC), 2017 IEEE International Symposium on, pages 66-75. IEEE, 2017.

[21]

R. Hadidi, B. Asgari, J. Young, B. A. Mudassar, K. Garg, T. Krishna, and H. Kim. Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube. arXiv preprint arXiv:1707.05399, 2017.

[22]

K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O'Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. Transparent offloading and mapping (tom): Enabling programmer-transparent near-data processing in gpu systems. ACM SIGARCH Computer Architecture News, 2016.

Digital Library

[23]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. TPDS, 2011.

Digital Library

[24]

J. Jeddeloh and B. Keeth. Hybrid memory cube new DRAM architecture increases density and performance. In VLSIT 2012.

[25]

G. Kim, J. Kim, J. H. Ahn, and J. Kim. Memory-centric system interconnect design with hybrid memory cubes. In PACT 2013.

Digital Library

[26]

J. Kloosterman, J. Beaumont, M. Wollman, A. Sethia, R. Dreslinski, T. Mudge, and S. Mahlke. Warppool: sharing requests with inter-warp coalescing for throughput processors. In MICRO 2015.

Digital Library

[27]

R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have your scratchpad and cache it too. In ACM SIGARCH Computer Architecture News, 2015.

Digital Library

[28]

D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA 1981.

Digital Library

[29]

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread aware prefetching mechanisms for gpgpu applications. In MICRO 2010.

Digital Library

[30]

J. D. Leidel and Y. Chen. HMC-Sim: A simulation framework for hybrid memory cube devices. Parallel Processing Letters, 24(04):1442002, 2014.

[31]

C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller, and H. Zhou. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS 2014.

[32]

P. Mahantesh Halappanavar. Grappolo. Technical report, Pacific Northwest National Laboratory, 2014.

[33]

R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. Costa, J. Doi, and C. Evangelinos. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59(2/3):17--1, 2015.

Digital Library

[34]

M. O'Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally. Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In MICRO 2017.

Digital Library

[35]

J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache misses for accurate data speculation and prefetching. In ICS 2014.

Digital Library

[36]

P. Prieto, V. Puente, and J. A. Gregorio. CMP off-chip bandwidth scheduling guided by instruction criticality. In ICS 2013.

Digital Library

[37]

S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory access scheduling. In ISCA 2000.

Digital Library

[38]

P. Rosenfeld. Performance exploration of the hybrid memory cube. PhD thesis, 2014.

[39]

L. Schares, B. G. Lee, F. Checconi, R. Budd, A. Rylyakov, N. Dupuis, F. Petrini, C. L. Schow, P. Fuentes, and O. Mattes. A throughput-optimized optical network for data-intensive computing. IEEE Micro, 2014.

[40]

J. Schmidt, H. Fröning, and U. Brüning. Exploring time and energy for complex accesses to a hybrid memory cube. In MEMSYS 2016.

Digital Library

[41]

A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis. MemZip: Exploring unconventional benefits from memory compression. In HPCA 2014.

[42]

D. Shin, J. Lee, J. Lee, and H.-J. Yoo. 14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks. In ISSCC 2017.

[43]

X. Wang, J. D. Leidel, and Y. Chen. Memory coalescing for hybrid memory cube. In ICPP. ACM, 2018.

Digital Library

[44]

A. Waterman, Y. Lee, R. Avizienis, D. A. Patterson, and K. Asanović. The risc-v instruction set manual volume ii: Privileged architecture version 1.7. Technical Report UCB/EECS-2015-49, May 2015.

[45]

D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In ISCA, 2011.

Digital Library

[46]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: throughput-oriented programmable processing in memory. In HPDC 2014.

Digital Library

Cited By

Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Hu JHuang JLi ZWang JHe T(2023)A Receiver-Driven Transport Protocol With High Link Utilization Using Anti-ECN Marking in Data Center NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.321834320:2(1898-1912)Online publication date: Jun-2023
https://doi.org/10.1109/TNSM.2022.3218343
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Show More Cited By

Index Terms

MAC: Memory Access Coalescer for 3D-Stacked Memory
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Emerging technologies
    1. Emerging interfaces
    2. Memory and dense storage

Recommendations

Memory Coalescing for Hybrid Memory Cube
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can ...
PAC: Paged Adaptive Coalescer for 3D-Stacked Memory
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Many contemporary data-intensive applications exhibit irregular and highly concurrent memory access patterns and thus challenge the performance of conventional memory systems. Driven by an expanding need for high-bandwidth memory featuring low access ...
Demystifying Complex Workload-DRAM Interactions: An Experimental Study
SIGMETRICS

It has become increasingly difficult to understand the complex interactions between modern applications and main memory, composed of Dynamic Random Access Memory (DRAM) chips. Manufacturers are now selling and proposing many different types of DRAM, with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

August 2019

1107 pages

ISBN:9781450362955

DOI:10.1145/3337821

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2019

ICPP 2019: 48th International Conference on Parallel Processing

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
276
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Hu JHuang JLi ZWang JHe T(2023)A Receiver-Driven Transport Protocol With High Link Utilization Using Anti-ECN Marking in Data Center NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.321834320:2(1898-1912)Online publication date: Jun-2023
https://doi.org/10.1109/TNSM.2022.3218343
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Hu JHuang JLi ZWang JHe T(2020)AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data CenterProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404412(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404412
Li JWang XTumeo AWilliams BLeidel JChen Y(2019)PIMSProceedings of the International Symposium on Memory Systems10.1145/3357526.3357550(41-52)Online publication date: 30-Sep-2019
https://dl.acm.org/doi/10.1145/3357526.3357550

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten