skip to main content
10.1145/3225058.3225062acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Memory Coalescing for Hybrid Memory Cube

Published: 13 August 2018 Publication History

Abstract

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average.

References

[1]
2010. Intel® Itanium® Architecture Software Developer's Manual, Revision 2.3 Volum 2, System Architecture. Technical Report. Intel. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf
[2]
2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia National Laboratories. http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf
[3]
2015. Hybrid Memory Cube Specification 2.1. Technical Report. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf
[4]
2015. Pico SB-850 Specification. Technical Report. http://picocomputing.com/wp-content/uploads/2015/05/Product-Brief-SB-8501.pdf
[5]
2016. Pico AC-510 Specification. Technical Report. http://picocomputing.com/wp-content/uploads/2015/05/Product-Brief-SB-8501.pdf
[6]
Shaizeen Aga and Satish Narayanasamy. 2017. InvisiMem: Smart Memory Defenses for Memory Bus Side Channel. In 44th ISCA. ACM, 94--106.
[7]
Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F Wenisch, John Danskin, and Stephen W Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In HPCA. IEEE, 494--506.
[8]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 42nd ISCA. IEEE, 105--117.
[9]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In 42nd ISCA. IEEE, 336--348.
[10]
David Bader and Kamesh Madduri. 2005. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors. High Performance Computing: HiPC 2005 3769 (2005), 465--476.
[11]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS Parallel Benchmark Results. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing '92). IEEE Computer Society Press, Los Alamitos, CA, USA, 386--393. http://dl.acm.org/citation.cfm?id=147877.148032
[12]
Kenneth E Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30-May 2, 1968, spring joint computer conference. ACM, 307--314.
[13]
Shuai Che, Jeremy W Sheaffer, and Kevin Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC 2011. ACM, 13.
[14]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In 43rd ISCA. IEEE Press, 27--39.
[15]
Hongwen Dai, Chao Li, Huiyang Zhou, Saurabh Gupta, Christos Kartsaklis, and Mike Mantor. 2016. A model-driven approach to warp/thread-block level GPU cache bypassing. In 53nd DAC. IEEE, 1--6.
[16]
Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP. IEEE Computer Society, Washington, DC, USA, 124--131.
[17]
Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. 2008. Mapreduce for data intensive scientific analyses. In eScience, 2008. eScience'08. IEEE Fourth International Conference on. IEEE, 277--284.
[18]
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In 21st HPCA. IEEE, 283--295.
[19]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In PACT. IEEE, 113--124.
[20]
Maya Gokhale, Scott Lloyd, and Chris Macaraeg. 2015. Hybrid memory cube performance characterization on data-centric workloads. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.
[21]
Alexander Greb and Gabriel Zachmann. 2006. GPU-ABiSort: Optimal parallel sorting on stream architectures. In 20th IPDPS. IEEE, 10-pp.
[22]
Qing Guo, Xiaochen Guo, Yuxin Bai, and Engin Ipek. 2011. A resistive TCAM accelerator for data-intensive computing. In 44th MICRO. ACM, 339--350.
[23]
Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. 2017. Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube. arXiv preprint arXiv:1707.05399 (2017).
[24]
Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal, and Mateo Valero. 2015. VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors. In 21st HPCA. IEEE, 26--38.
[25]
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In 43rd ISCA. IEEE Press, 204--216.
[26]
Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 105--118.
[27]
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In 22nd PACT. IEEE Press, 145--156.
[28]
John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke. WarpPool:sharingrequestswithinter-warp coalescing for throughput processors. In MICRO 2015. ACM, 433--444.
[29]
David Kroft. 1981. Lockup-free instruction fetch/prefetch cache organization. In 8th ISCA. IEEE Computer Society Press, 81--87.
[30]
John D. McCalpin. 1995. A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers. (1995).
[31]
Vaughan R Pratt. 1972. Shellsort and Sorting Networks. Technical Report. STANFORD UNIV CALIF DEPT OF COMPUTER SCIENCE.
[32]
Pablo Prieto, Valentin Puente, and Jose Angel Gregorio. 2013. CMP off-chip bandwidth scheduling guided by instruction criticality. In 27th ICS. ACM, 379--388.
[33]
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. In 46th MICRO. IEEE, 86--98.
[34]
Paul Rosenfeld. 2014. Performance exploration of the hybrid memory cube. Ph.D. Dissertation.
[35]
James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable cache miss handling for high memory-level parallelism. In 39th MICRO. IEEE Computer Society, 409--422.
[36]
Xi Wang, John D Leidel, and Yong Chen. 2016. Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture. In Proceedings of the Second International Symposium on Memory Systems. ACM, 177--187.
[37]
Xi Wang, John D Leidel, and Yong Chen. 2017. OpenMP Memkind: An Extension for Heterogeneous Physical Memories. In Parallel Processing Workshops (ICPPW), 2017 46th International Conference on. IEEE, 220--227.
[38]
Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanović. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html
[39]
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne. 2010. High performance comparison-based sorting algorithm on many-core GPUs. In IPDPS. IEEE, 1--10.
[40]
Doe Hyun Yoon, Min Kyu Jeong, and Mattan Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In 38th ISCA, 2011. IEEE, 295--306.
[41]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In OSDI. 249--265.
[42]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In 23rd HPDC. ACM, 85--98.

Cited By

View all
  • (2022)Performance investigation of packet-based communication in 3D-memoriesThe Journal of Supercomputing10.1007/s11227-022-04605-178:17(19070-19096)Online publication date: 1-Nov-2022
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cache
  2. Hybrid Memory Cube
  3. MSHR
  4. Memory Coalescing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Performance investigation of packet-based communication in 3D-memoriesThe Journal of Supercomputing10.1007/s11227-022-04605-178:17(19070-19096)Online publication date: 1-Nov-2022
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media