Skip to main content

Advertisement

Log in

Stream data prefetcher for the GPU memory interface

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Data caches are often unable to efficiently cope with the massive and simultaneous requests imposed by the SIMT execution model of modern GPUs. While software-aided cache management techniques and scheduling approaches were early considered, efficient prefetching schemes are regarded as the most viable solution to improve the efficiency of the GPU memory subsystem. Accordingly, a new GPU prefetching mechanism is proposed, by extending the stream computing model beyond the actual GPU processing core, thus broadening it toward the memory interface. The proposed prefetcher takes advantage of the available cache management resources and combines a low-profile architecture with a dedicated pattern descriptor specification, which is used to explicitly encode each kernel memory access pattern. The obtained results show that the proposed mechanism increases the L1 data cache hit rate by an average of 61%, resulting in performance speedups as high as 9.2\(\times \) and consequent energy efficiency improvements as high as 11\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Amilkanthwar M, Balachandran S (2013) CUPL: A compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th ACM International Conference On Supercomputing. ACM, pp 459–460

  2. Arnau JM, Parcerisa JM, Xekalakis P (2012) Boosting mobile GPU performance with a decoupled access/execute fragment processor. ACM SIGARCH Comput Archit News 40(3):84–93

    Article  Google Scholar 

  3. Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163–174

  4. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp 44–54

  5. Ghosh S, Martonosi M, Malik S (1997) Cache miss equations: An analytical representation of cache misses. In: ACM International Conference on Supercomputing. ACM Press, pp 317–324

  6. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), 2012. IEEE, pp 1–10

  7. Grosser T, Groesslinger A, Lengauer C (2012) Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Process Lett 22(04):1250010

    Article  MathSciNet  Google Scholar 

  8. Jia W, Shaw K, Martonosi M (2014) MRPB: Memory request prioritization for massively parallel processors. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 272–283

  9. Jia W, Shaw KA, Martonosi M (2012) Characterizing and improving the use of demand-fetched caches in GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing. ACM, pp 15–24

  10. Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Comput Archit News 41(3):332–343

    Article  Google Scholar 

  11. Lakshminarayana NB, Kim H (2014) Spare register aware prefetching for graph algorithms on gpus. In: IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 614–625

  12. Lee J, Lakshminarayana NB, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213–224

  13. Lee S, Kim K, Koo G, Jeon H, Ro WW, Annavaram M (2015) Warped-compression: enabling power efficient GPUs through register compression. In: 42nd Intl Symposium on Computer Architecture. ACM, pp 502–514

  14. Leng J, Hetherington T, ElTantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput Archit News 41(3):487–498

    Article  Google Scholar 

  15. Neves N, Tomás P, Roma N (2017) Adaptive in-cache streaming for efficient data management. IEEE Trans Very Large Scale Integr (VLSI) Syst 25(7):2130–2143

    Article  Google Scholar 

  16. NVIDIA (2009) NVIDIA’s Next Generation CUDATM Compute Architecture: FermiTM. NVIDIA, Santa Clara, Calif, USA

  17. NVIDIA (2016) NVIDIA GP100 Pascal Architecture. White paper (Online). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

  18. Panda R, Eckert Y, Jayasena N, Kayiran O, Boyer M, John LK (2016) Prefetching techniques for near-memory throughput processors. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16. ACM, New York, pp. 40:1–40:14

  19. Sethia A, Dasika G, Samadi M, Mahlke S (2013) APOGEE: Adaptive prefetching on GPUs for energy efficiency. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, pp 73–82

  20. Stephenson M, Hari SKS, Lee Y, Ebrahimi E, Johnson DR, Nellans D, O’Connor M, Keckler SW (2015) Flexible software profiling of GPU architectures. In: 42nd International Symposium on Computer Architecture. ACM, pp 185–197

  21. Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: International Conference on High Performance Computing and Simulation (HPCS). IEEE, pp 631–639

  22. Wu B, Zhao Z, Zhang EZ, Jiang Y, Shen X (2013) Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Not 48(8):57–68

    Article  Google Scholar 

  23. Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 76–88

Download references

Acknowledgements

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under project UID/CEC/50021/2013 and research grant SFRH/BD/100697/2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuno Neves.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Neves, N., Tomás, P. & Roma, N. Stream data prefetcher for the GPU memory interface. J Supercomput 74, 2314–2328 (2018). https://doi.org/10.1007/s11227-018-2260-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2260-6

Keywords

Navigation