research-article

Prefetching Techniques for Near-memory Throughput Processors

Authors:

Nuwan Jayasena,

Lizy Kurian JohnAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 40, Pages 1 - 14

https://doi.org/10.1145/2925426.2926282

Published: 01 June 2016 Publication History

Abstract

Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.

References

[1]

Advancing Moore's Law on 2014! http://www.intel.com/content/dam/www/public/us/en/documents/presentation/advancing-moores-law-in-2014-presentation.pdf.

[2]

NVIDIA's next generation CUDA compute architecture, Fermi, 2009.

[3]

Nvidia. CUDA c/c++ sdk code samples, 2011.

[4]

Hybrid memory cube consortium. Hybrid Memory Cube Specification 1.0, 2013.

[5]

Jedec standard jesd235. High Bandwidth Memory (HBM) DRAM, 2013.

[6]

Jedec standard jesd235a. High Bandwidth Memory (HBM) 2 DRAM, 2016.

[7]

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 105--117, New York, NY, USA, 2015. ACM.

Digital Library

[8]

J. Ahn, S. Yoo, O. Mutlu, and K. Choi. Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 336--348, New York, NY, USA, 2015. ACM.

Digital Library

[9]

S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP, pages 23--34, New York, NY, USA, 2012. ACM.

Digital Library

[10]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174. IEEE Computer Society, 2009.

[11]

P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium, HPC '13, pages 5:1--5:8, San Diego, CA, USA, 2013. Society for Computer Simulation International.

Digital Library

[12]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C. C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: building a smarter memory controller. In High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, pages 70--79, Jan. 1999.

Digital Library

[13]

K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens. System and circuit level power modeling of energy-efficient 3d-stacked wide i/o drams. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 236--241, San Jose, CA, USA, 2013. EDA Consortium.

Digital Library

[14]

K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens. Drampower: Open-source dram power and energy estimation tool.

[15]

D. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. Kim, and M. Schulte. Reevaluating the latency claims of 3d stacked memories. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 657--662, Jan 2013.

[16]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185--195. IEEE Computer Society, 2013.

[17]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[18]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU-3, pages 63--74, New York, NY, USA, 2010. ACM.

Digital Library

[19]

J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca. The architecture of the diva processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing, ICS '02, pages 14--25, New York, NY, USA, 2002. ACM.

Digital Library

[20]

E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA, pages 7--17. IEEE Computer Society, 2009.

[21]

Y. Eckert, N. Jayasena, and G. H. Loh. Thermal feasibility of die-stacked processing in memory. In WoNDP: 2nd Workshop on Near-Data Processing, 2014.

[22]

Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In Proceedings of the 21st Annual International Conference on Supercomputing, pages 232--241, New York, NY, USA, 2007. ACM.

Digital Library

[23]

A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 283--295, Feb 2015.

[24]

Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 209--220, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[25]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343, New York, NY, USA, 2013. ACM.

Digital Library

[26]

A. Jog, O. Kayiran, K. Mishra, and M. T. K. Owl: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In In ASPLOS, 2013.

Digital Library

[27]

D. Joseph and D. Grunwald. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 252--263, New York, NY, USA, 1997. ACM.

Digital Library

[28]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, pages 157--166, 2013.

Digital Library

[29]

S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO, pages 175--186, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[30]

A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pages 144--154, New York, NY, USA, 2001. ACM.

Digital Library

[31]

N. B. Lakshminarayana and H. Kim. Spare register aware prefetching for graph algorithms on GPUs. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15-19, 2014, pages 614--625, 2014.

[32]

J. Lee, N. B. Lakshminarayana, H. Kim, and R. W. Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO, pages 213--224. IEEE Computer Society, 2010.

Digital Library

[33]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 487--498, New York, NY, USA, 2013. ACM.

Digital Library

[34]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.

Digital Library

[35]

D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. Priority-based cache allocation in throughput processors. In HPCA, pages 89--100. IEEE, 2015.

[36]

K.-N. Lim, W.-J. Jang, H.-S. Won, K.-Y. Lee, H. Kim, D.-W. Kim, M.-H. Cho, S.-L. Kim, J.-H. Kang, K.-W. Park, and B.-T. Jeong. A 1.2v 23nm 6f2 4gb ddr3 sdram with local-bitline sense amplifier, hybrid lio sense amplifier and dummy-less array architecture. In ISSCC, pages 42--44. IEEE, 2012.

[37]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 222--233, 2008.

Digital Library

[38]

G. H. Loh, N. Jayasena, M. H. Oskin, M. Nutter, D. Roberts, M. Meswani, D. P. Zhang, and M. Ignatowski. A processing in memory taxonomy and a case for studying fixed-function pim. In WoNDP: 1st Workshop on Near-Data Processing, 2013.

[39]

S. Mu, Y. Deng, Y. Chen, H. Li, J. Pan, W. Zhang, and Z. Wang. Orchestrating cache management and memory scheduling for GPGPU applications. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 22(8):1803--1814, Aug 2014.

[40]

K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, pages 96--, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[41]

M. Oskin, F. T. Chong, and T. Sherwood. Active pages: A computation model for intelligent memory. In ISCA, pages 192--203, Washington, DC, USA, 1998. IEEE Computer Society.

Digital Library

[42]

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A case for intelligent ram. IEEE Micro, 17(2):34--44, Mar. 1997.

Digital Library

[43]

S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Ndc: Analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads. 2014 IEEE International Symposium on Performance Analysis of Systems and Software, 0:190--200, 2014.

[44]

A. Sethia, G. Dasika, M. Samadi, and S. Mahlke. Apogee: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 73--82, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[45]

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, pages 252--263, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[46]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 63--74, Washington, DC, USA, 2007.

Digital Library

[47]

J. Torrellas. Flexram: Toward an advanced intelligent memory system: A retrospective paper. In ICCD, pages 3--4. IEEE Computer Society, 2012.

Digital Library

[48]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, pages 235--246. IEEE Computer Society, 2010.

[49]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.

Digital Library

[50]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, pages 85--98, New York, NY, USA, 2014.

Digital Library

[51]

D. P. Zhang, N. Jayasena, A. Lyashevsky, J. Greathouse, M. Meswani, M. Nutter, and M. Ignatowski. A new perspective on processing-in-memory architecture design. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '13, pages 7:1--7:3, New York, NY, USA, 2013. ACM.

Digital Library

Cited By

Mostofi SFalahati HMahani NLotfi-Kamran PSarbazi-Azad H(2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623782
Choi JKim BJeon JLee HLim ERhee C(2022)A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image ProcessingIEEE Transactions on Computers10.1109/TC.2020.303582671:1(13-26)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3035826
Rafique MZhu Z(2021)Memory-Side Prefetching Scheme Incorporating Dynamic Page Mode in 3D-Stacked DRAMIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.304485632:11(2734-2747)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TPDS.2020.3044856
Show More Cited By

Prefetching Techniques for Near-memory Throughput Processors
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Correlation Prefetching with a User-Level Memory Thread

This paper proposes using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs ...
Stealth prefetching
Proceedings of the 2006 ASPLOS Conference

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Designing a Modern Memory Hierarchy with Hardware Prefetching

In this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
568
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mostofi SFalahati HMahani NLotfi-Kamran PSarbazi-Azad H(2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623782
Choi JKim BJeon JLee HLim ERhee C(2022)A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image ProcessingIEEE Transactions on Computers10.1109/TC.2020.303582671:1(13-26)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3035826
Rafique MZhu Z(2021)Memory-Side Prefetching Scheme Incorporating Dynamic Page Mode in 3D-Stacked DRAMIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.304485632:11(2734-2747)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TPDS.2020.3044856
Punniyamurthy KGerstlauer A(2020)Off-Chip Congestion Management for GPU-based Non-Uniform Processing-in-Memory Networks2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00050(282-289)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00050
Afoakwa RLu LWu HHuang M(2019)To Stack or Not To StackProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00017(110-123)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00017
Rafique MZhu Z(2018)CAMPSProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225112(1-9)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225112
Neves NTomás PRoma N(2018)Stream data prefetcher for the GPU memory interfaceThe Journal of Supercomputing10.1007/s11227-018-2260-674:6(2314-2328)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11227-018-2260-6
Kersey CKim HYalamanchili SJacob B(2017)Lightweight SIMT core designs for intelligent 3D stacked DRAMProceedings of the International Symposium on Memory Systems10.1145/3132402.3132426(49-59)Online publication date: 2-Oct-2017
https://dl.acm.org/doi/10.1145/3132402.3132426
Panda RZheng XWang JGerstlauer AJohn L(2017)Statistical Pattern Based Modeling of GPU Memory Access StreamsProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062320(1-6)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3061639.3062320
Michelogiannakis GShalf J(2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten