skip to main content
10.1145/3337821.3337867acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections

MAC: Memory Access Coalescer for 3D-Stacked Memory

Published: 05 August 2019 Publication History


Emerging data-intensive applications, such as graph analytics and data mining, exhibit irregular memory access patterns. Research has shown that with these memory-bound applications, traditional cache-based processor architectures, which exploit locality and regular patterns to mitigate the memory-wall issue, are inefficient. Meantime, novel 3D-stacked memory devices, such as Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), promise significant increases in bandwidth that appear extremely appealing for memory-bound applications. However, conventional memory interfaces designed for cache-based architectures and JEDEC DDR devices fit poorly with the 3D-stacked memory, which leads to significant under-utilization of the promised high bandwidth.
As a response to these issues, in this paper we propose MAC (Memory Access Coalescer), a coalescing unit for the 3D-stacked memory. We discuss the design and implementation of MAC, in the context of a custom designed cache-less architecture targeted at data-intensive, irregular applications. Through a custom simulation infrastructure based on the RISC-V toolchain, we show that MAC achieves a coalescing efficiency of 52.85% on average. It improves the performance of the memory system by 60.73% on average for a large set of irregular workloads.


JEDEC Standard High Bandwidth Memory(HBM) DRAM Specification. Technical report, 2013.
Toward a New Metric for Ranking High Performance Computing Systems. Technical report, Sandia National Laboratories, 2013.
HMC Specification 2.1. Technical report, December 2015.
CUDA Toolkit Documentation. Technical report, July 2018.
S. Aga and S. Narayanasamy. Invisimem: Smart memory defenses for memory bus side channel. In ISCA 2017.
N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective gpu caches to eliminate cpu-gpu hw cache coherence. In HPCA 2016.
J. Ahn, S. Yoo, O. Mutlu, and K. Choi. Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In ISCA 2015.
D. Bader and K. Madduri. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors. HiPC 2005.
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. NAS parallel benchmark results. SC 1992, Los Alamitos, CA, USA.
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In CODES 2002.
S. Beamer, K. Asanovic, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.
S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In SC 2011.
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, 2016.
H. Dai, C. Li, H. Zhou, S. Gupta, C. Kartsaklis, and M. Mantor. A model-driven approach to warp/thread-block level gpu cache bypassing. In DAC 2016.
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In ICPP 2009.
N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on gpus. In CGO 2015.
M. Gao, G. Ayers, and C. Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In PACT 2015.
M. Gokhale, S. Lloyd, and C. Macaraeg. Hybrid memory cube performance characterization on data-centric workloads. In IA3 2015.
E. H. Gornish, E. D. Granston, and A. V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In ICS 2014.
R. Hadidi, B. Asgari, B. A. Mudassar, S. Mukhopadhyay, S. Yalamanchili, and H. Kim. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Workload Characterization (IISWC), 2017 IEEE International Symposium on, pages 66-75. IEEE, 2017.
R. Hadidi, B. Asgari, J. Young, B. A. Mudassar, K. Garg, T. Krishna, and H. Kim. Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube. arXiv preprint arXiv:1707.05399, 2017.
K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O'Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. Transparent offloading and mapping (tom): Enabling programmer-transparent near-data processing in gpu systems. ACM SIGARCH Computer Architecture News, 2016.
B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. TPDS, 2011.
J. Jeddeloh and B. Keeth. Hybrid memory cube new DRAM architecture increases density and performance. In VLSIT 2012.
G. Kim, J. Kim, J. H. Ahn, and J. Kim. Memory-centric system interconnect design with hybrid memory cubes. In PACT 2013.
J. Kloosterman, J. Beaumont, M. Wollman, A. Sethia, R. Dreslinski, T. Mudge, and S. Mahlke. Warppool: sharing requests with inter-warp coalescing for throughput processors. In MICRO 2015.
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. Stash: Have your scratchpad and cache it too. In ACM SIGARCH Computer Architecture News, 2015.
D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA 1981.
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread aware prefetching mechanisms for gpgpu applications. In MICRO 2010.
J. D. Leidel and Y. Chen. HMC-Sim: A simulation framework for hybrid memory cube devices. Parallel Processing Letters, 24(04):1442002, 2014.
C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller, and H. Zhou. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS 2014.
P. Mahantesh Halappanavar. Grappolo. Technical report, Pacific Northwest National Laboratory, 2014.
R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. Costa, J. Doi, and C. Evangelinos. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59(2/3):17--1, 2015.
M. O'Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally. Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In MICRO 2017.
J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai. Bloom filtering cache misses for accurate data speculation and prefetching. In ICS 2014.
P. Prieto, V. Puente, and J. A. Gregorio. CMP off-chip bandwidth scheduling guided by instruction criticality. In ICS 2013.
S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory access scheduling. In ISCA 2000.
P. Rosenfeld. Performance exploration of the hybrid memory cube. PhD thesis, 2014.
L. Schares, B. G. Lee, F. Checconi, R. Budd, A. Rylyakov, N. Dupuis, F. Petrini, C. L. Schow, P. Fuentes, and O. Mattes. A throughput-optimized optical network for data-intensive computing. IEEE Micro, 2014.
J. Schmidt, H. Fröning, and U. Brüning. Exploring time and energy for complex accesses to a hybrid memory cube. In MEMSYS 2016.
A. Shafiee, M. Taassori, R. Balasubramonian, and A. Davis. MemZip: Exploring unconventional benefits from memory compression. In HPCA 2014.
D. Shin, J. Lee, J. Lee, and H.-J. Yoo. 14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks. In ISSCC 2017.
X. Wang, J. D. Leidel, and Y. Chen. Memory coalescing for hybrid memory cube. In ICPP. ACM, 2018.
A. Waterman, Y. Lee, R. Avizienis, D. A. Patterson, and K. Asanović. The risc-v instruction set manual volume ii: Privileged architecture version 1.7. Technical Report UCB/EECS-2015-49, May 2015.
D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In ISCA, 2011.
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: throughput-oriented programmable processing in memory. In HPDC 2014.

Cited By

View all
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)A Receiver-Driven Transport Protocol With High Link Utilization Using Anti-ECN Marking in Data Center NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.321834320:2(1898-1912)Online publication date: Jun-2023
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Other conferences
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
August 2019
1107 pages
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.


  • University of Tsukuba: University of Tsukuba


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019


Request permissions for this article.

Check for updates

Author Tags

  1. 3D-Stacked Memory
  2. Irregular Applications
  3. Memory Coalescing


  • Research-article
  • Research
  • Refereed limited


ICPP 2019

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Feb 2025

Other Metrics


Cited By

View all
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)A Receiver-Driven Transport Protocol With High Link Utilization Using Anti-ECN Marking in Data Center NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.321834320:2(1898-1912)Online publication date: Jun-2023
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • (2020)AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data CenterProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404412(1-10)Online publication date: 17-Aug-2020
  • (2019)PIMSProceedings of the International Symposium on Memory Systems10.1145/3357526.3357550(41-52)Online publication date: 30-Sep-2019

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media