research-article

Memory Coalescing for Hybrid Memory Cube

Authors:

John D. Leidel,

Yong ChenAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 62, Pages 1 - 10

https://doi.org/10.1145/3225058.3225062

Published: 13 August 2018 Publication History

Abstract

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average.

References

[1]

2010. Intel® Itanium® Architecture Software Developer's Manual, Revision 2.3 Volum 2, System Architecture. Technical Report. Intel. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf

[2]

2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia National Laboratories. http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf

[3]

2015. Hybrid Memory Cube Specification 2.1. Technical Report. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf

[4]

2015. Pico SB-850 Specification. Technical Report. http://picocomputing.com/wp-content/uploads/2015/05/Product-Brief-SB-8501.pdf

[5]

2016. Pico AC-510 Specification. Technical Report. http://picocomputing.com/wp-content/uploads/2015/05/Product-Brief-SB-8501.pdf

[6]

Shaizeen Aga and Satish Narayanasamy. 2017. InvisiMem: Smart Memory Defenses for Memory Bus Side Channel. In 44th ISCA. ACM, 94--106.

Digital Library

[7]

Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F Wenisch, John Danskin, and Stephen W Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In HPCA. IEEE, 494--506.

[8]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 42nd ISCA. IEEE, 105--117.

Digital Library

[9]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In 42nd ISCA. IEEE, 336--348.

Digital Library

[10]

David Bader and Kamesh Madduri. 2005. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors. High Performance Computing: HiPC 2005 3769 (2005), 465--476.

Digital Library

[11]

D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS Parallel Benchmark Results. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing '92). IEEE Computer Society Press, Los Alamitos, CA, USA, 386--393. http://dl.acm.org/citation.cfm?id=147877.148032

Digital Library

[12]

Kenneth E Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30-May 2, 1968, spring joint computer conference. ACM, 307--314.

Digital Library

[13]

Shuai Che, Jeremy W Sheaffer, and Kevin Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC 2011. ACM, 13.

Digital Library

[14]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In 43rd ISCA. IEEE Press, 27--39.

Digital Library

[15]

Hongwen Dai, Chao Li, Huiyang Zhou, Saurabh Gupta, Christos Kartsaklis, and Mike Mantor. 2016. A model-driven approach to warp/thread-block level GPU cache bypassing. In 53nd DAC. IEEE, 1--6.

Digital Library

[16]

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP. IEEE Computer Society, Washington, DC, USA, 124--131.

Digital Library

[17]

Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. 2008. Mapreduce for data intensive scientific analyses. In eScience, 2008. eScience'08. IEEE Fourth International Conference on. IEEE, 277--284.

Digital Library

[18]

Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In 21st HPCA. IEEE, 283--295.

[19]

Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In PACT. IEEE, 113--124.

Digital Library

[20]

Maya Gokhale, Scott Lloyd, and Chris Macaraeg. 2015. Hybrid memory cube performance characterization on data-centric workloads. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.

Digital Library

[21]

Alexander Greb and Gabriel Zachmann. 2006. GPU-ABiSort: Optimal parallel sorting on stream architectures. In 20th IPDPS. IEEE, 10-pp.

Digital Library

[22]

Qing Guo, Xiaochen Guo, Yuxin Bai, and Engin Ipek. 2011. A resistive TCAM accelerator for data-intensive computing. In 44th MICRO. ACM, 339--350.

Digital Library

[23]

Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. 2017. Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube. arXiv preprint arXiv:1707.05399 (2017).

[24]

Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal, and Mateo Valero. 2015. VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors. In 21st HPCA. IEEE, 26--38.

[25]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In 43rd ISCA. IEEE Press, 204--216.

Digital Library

[26]

Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 105--118.

Digital Library

[27]

Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In 22nd PACT. IEEE Press, 145--156.

Digital Library

[28]

John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke. WarpPool:sharingrequestswithinter-warp coalescing for throughput processors. In MICRO 2015. ACM, 433--444.

Digital Library

[29]

David Kroft. 1981. Lockup-free instruction fetch/prefetch cache organization. In 8th ISCA. IEEE Computer Society Press, 81--87.

Digital Library

[30]

John D. McCalpin. 1995. A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers. (1995).

[31]

Vaughan R Pratt. 1972. Shellsort and Sorting Networks. Technical Report. STANFORD UNIV CALIF DEPT OF COMPUTER SCIENCE.

[32]

Pablo Prieto, Valentin Puente, and Jose Angel Gregorio. 2013. CMP off-chip bandwidth scheduling guided by instruction criticality. In 27th ICS. ACM, 379--388.

Digital Library

[33]

Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. In 46th MICRO. IEEE, 86--98.

Digital Library

[34]

Paul Rosenfeld. 2014. Performance exploration of the hybrid memory cube. Ph.D. Dissertation.

[35]

James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable cache miss handling for high memory-level parallelism. In 39th MICRO. IEEE Computer Society, 409--422.

Digital Library

[36]

Xi Wang, John D Leidel, and Yong Chen. 2016. Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture. In Proceedings of the Second International Symposium on Memory Systems. ACM, 177--187.

Digital Library

[37]

Xi Wang, John D Leidel, and Yong Chen. 2017. OpenMP Memkind: An Extension for Heterogeneous Physical Memories. In Parallel Processing Workshops (ICPPW), 2017 46th International Conference on. IEEE, 220--227.

[38]

Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanović. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html

[39]

Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne. 2010. High performance comparison-based sorting algorithm on many-core GPUs. In IPDPS. IEEE, 1--10.

[40]

Doe Hyun Yoon, Min Kyu Jeong, and Mattan Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In 38th ISCA, 2011. IEEE, 295--306.

Digital Library

[41]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In OSDI. 249--265.

Digital Library

[42]

Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In 23rd HPDC. ACM, 85--98.

Digital Library

Cited By

Pandey SVenkatesh T(2022)Performance investigation of packet-based communication in 3D-memoriesThe Journal of Supercomputing10.1007/s11227-022-04605-178:17(19070-19096)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04605-1
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867

Index Terms

Memory Coalescing for Hybrid Memory Cube
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Emerging technologies
    1. Emerging interfaces
    2. Memory and dense storage

Recommendations

Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Through-Silicon Vias (TSVs) and three-dimensional die stacking technologies are enabling a combination of DRAM and CMOS die layer within a single stack, leading to stacked memory. Functionality that was previously associated with the microprocessor, ...
Energy efficient Phase Change Memory based main memory for future high performance systems
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Phase Change Memory (PCM) has recently attracted a lot of attention as a scalable alternative to DRAM for main memory systems. As the need for high-density memory increases, DRAM has proven to be less attractive from the point of view of scaling and ...
Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

The majority of modern microprocessors are architected to utilize multi-level data caches as a primary optimization to reduce the latency and increase the perceived bandwidth from an application. The spatial and temporal locality provided by data caches ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
162
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pandey SVenkatesh T(2022)Performance investigation of packet-based communication in 3D-memoriesThe Journal of Supercomputing10.1007/s11227-022-04605-178:17(19070-19096)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04605-1
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten