skip to main content
10.1145/2925426.2926282acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Prefetching Techniques for Near-memory Throughput Processors

Published: 01 June 2016 Publication History

Abstract

Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.

References

[1]
Advancing Moore's Law on 2014! http://www.intel.com/content/dam/www/public/us/en/documents/presentation/advancing-moores-law-in-2014-presentation.pdf.
[2]
NVIDIA's next generation CUDA compute architecture, Fermi, 2009.
[3]
Nvidia. CUDA c/c++ sdk code samples, 2011.
[4]
Hybrid memory cube consortium. Hybrid Memory Cube Specification 1.0, 2013.
[5]
Jedec standard jesd235. High Bandwidth Memory (HBM) DRAM, 2013.
[6]
Jedec standard jesd235a. High Bandwidth Memory (HBM) 2 DRAM, 2016.
[7]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 105--117, New York, NY, USA, 2015. ACM.
[8]
J. Ahn, S. Yoo, O. Mutlu, and K. Choi. Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 336--348, New York, NY, USA, 2015. ACM.
[9]
S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP, pages 23--34, New York, NY, USA, 2012. ACM.
[10]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174. IEEE Computer Society, 2009.
[11]
P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium, HPC '13, pages 5:1--5:8, San Diego, CA, USA, 2013. Society for Computer Simulation International.
[12]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C. C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: building a smarter memory controller. In High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, pages 70--79, Jan. 1999.
[13]
K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens. System and circuit level power modeling of energy-efficient 3d-stacked wide i/o drams. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 236--241, San Jose, CA, USA, 2013. EDA Consortium.
[14]
K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens. Drampower: Open-source dram power and energy estimation tool.
[15]
D. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. Kim, and M. Schulte. Reevaluating the latency claims of 3d stacked memories. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 657--662, Jan 2013.
[16]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In IISWC, pages 185--195. IEEE Computer Society, 2013.
[17]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society.
[18]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU-3, pages 63--74, New York, NY, USA, 2010. ACM.
[19]
J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca. The architecture of the diva processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing, ICS '02, pages 14--25, New York, NY, USA, 2002. ACM.
[20]
E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA, pages 7--17. IEEE Computer Society, 2009.
[21]
Y. Eckert, N. Jayasena, and G. H. Loh. Thermal feasibility of die-stacked processing in memory. In WoNDP: 2nd Workshop on Near-Data Processing, 2014.
[22]
Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In Proceedings of the 21st Annual International Conference on Supercomputing, pages 232--241, New York, NY, USA, 2007. ACM.
[23]
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 283--295, Feb 2015.
[24]
Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 209--220, Washington, DC, USA, 2002. IEEE Computer Society.
[25]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343, New York, NY, USA, 2013. ACM.
[26]
A. Jog, O. Kayiran, K. Mishra, and M. T. K. Owl: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In In ASPLOS, 2013.
[27]
D. Joseph and D. Grunwald. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 252--263, New York, NY, USA, 1997. ACM.
[28]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, pages 157--166, 2013.
[29]
S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO, pages 175--186, Washington, DC, USA, 2010. IEEE Computer Society.
[30]
A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA '01, pages 144--154, New York, NY, USA, 2001. ACM.
[31]
N. B. Lakshminarayana and H. Kim. Spare register aware prefetching for graph algorithms on GPUs. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15-19, 2014, pages 614--625, 2014.
[32]
J. Lee, N. B. Lakshminarayana, H. Kim, and R. W. Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO, pages 213--224. IEEE Computer Society, 2010.
[33]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 487--498, New York, NY, USA, 2013. ACM.
[34]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.
[35]
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. Priority-based cache allocation in throughput processors. In HPCA, pages 89--100. IEEE, 2015.
[36]
K.-N. Lim, W.-J. Jang, H.-S. Won, K.-Y. Lee, H. Kim, D.-W. Kim, M.-H. Cho, S.-L. Kim, J.-H. Kang, K.-W. Park, and B.-T. Jeong. A 1.2v 23nm 6f2 4gb ddr3 sdram with local-bitline sense amplifier, hybrid lio sense amplifier and dummy-less array architecture. In ISSCC, pages 42--44. IEEE, 2012.
[37]
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 222--233, 2008.
[38]
G. H. Loh, N. Jayasena, M. H. Oskin, M. Nutter, D. Roberts, M. Meswani, D. P. Zhang, and M. Ignatowski. A processing in memory taxonomy and a case for studying fixed-function pim. In WoNDP: 1st Workshop on Near-Data Processing, 2013.
[39]
S. Mu, Y. Deng, Y. Chen, H. Li, J. Pan, W. Zhang, and Z. Wang. Orchestrating cache management and memory scheduling for GPGPU applications. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 22(8):1803--1814, Aug 2014.
[40]
K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, pages 96--, Washington, DC, USA, 2004. IEEE Computer Society.
[41]
M. Oskin, F. T. Chong, and T. Sherwood. Active pages: A computation model for intelligent memory. In ISCA, pages 192--203, Washington, DC, USA, 1998. IEEE Computer Society.
[42]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A case for intelligent ram. IEEE Micro, 17(2):34--44, Mar. 1997.
[43]
S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Ndc: Analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads. 2014 IEEE International Symposium on Performance Analysis of Systems and Software, 0:190--200, 2014.
[44]
A. Sethia, G. Dasika, M. Samadi, and S. Mahlke. Apogee: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 73--82, Piscataway, NJ, USA, 2013. IEEE Press.
[45]
S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, pages 252--263, Washington, DC, USA, 2006. IEEE Computer Society.
[46]
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 63--74, Washington, DC, USA, 2007.
[47]
J. Torrellas. Flexram: Toward an advanced intelligent memory system: A retrospective paper. In ICCD, pages 3--4. IEEE Computer Society, 2012.
[48]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS, pages 235--246. IEEE Computer Society, 2010.
[49]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.
[50]
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, pages 85--98, New York, NY, USA, 2014.
[51]
D. P. Zhang, N. Jayasena, A. Lyashevsky, J. Greathouse, M. Meswani, M. Nutter, and M. Ignatowski. A new perspective on processing-in-memory architecture design. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '13, pages 7:1--7:3, New York, NY, USA, 2013. ACM.

Cited By

View all
  • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
  • (2022)A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image ProcessingIEEE Transactions on Computers10.1109/TC.2020.303582671:1(13-26)Online publication date: 1-Jan-2022
  • (2021)Memory-Side Prefetching Scheme Incorporating Dynamic Page Mode in 3D-Stacked DRAMIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.304485632:11(2734-2747)Online publication date: 1-Nov-2021
  • Show More Cited By
  1. Prefetching Techniques for Near-memory Throughput Processors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D die-stacked Memory
    2. GPU
    3. Prefetching
    4. Processing-in-memory

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
    • (2022)A Lightweight and Efficient GPU for NDP Utilizing Data Access Pattern of Image ProcessingIEEE Transactions on Computers10.1109/TC.2020.303582671:1(13-26)Online publication date: 1-Jan-2022
    • (2021)Memory-Side Prefetching Scheme Incorporating Dynamic Page Mode in 3D-Stacked DRAMIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.304485632:11(2734-2747)Online publication date: 1-Nov-2021
    • (2020)Off-Chip Congestion Management for GPU-based Non-Uniform Processing-in-Memory Networks2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00050(282-289)Online publication date: Mar-2020
    • (2019)To Stack or Not To StackProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00017(110-123)Online publication date: 23-Sep-2019
    • (2018)CAMPSProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225112(1-9)Online publication date: 13-Aug-2018
    • (2018)Stream data prefetcher for the GPU memory interfaceThe Journal of Supercomputing10.1007/s11227-018-2260-674:6(2314-2328)Online publication date: 1-Jun-2018
    • (2017)Lightweight SIMT core designs for intelligent 3D stacked DRAMProceedings of the International Symposium on Memory Systems10.1145/3132402.3132426(49-59)Online publication date: 2-Oct-2017
    • (2017)Statistical Pattern Based Modeling of GPU Memory Access StreamsProceedings of the 54th Annual Design Automation Conference 201710.1145/3061639.3062320(1-6)Online publication date: 18-Jun-2017
    • (2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media