Skip to main content

Advertisement

Log in

Power-efficient prefetching on GPGPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing waiting time and the percentage of unutilized elements. Also, using processing elements in a suitable manner to prefetch data during stall times bridges the memory gap in an energy-efficient manner, and consequently leads to less power and energy consumption. Simulation results show that we can potentially improve instruction per cycle (IPC) up to 24.76 %. Moreover, results show that power, energy and energy efficiency improve by up to 22.47, 24.72 and 36.01 %, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Keckler SW, Olukotun L, Hofstee HP (2009) Multicore processors and systems. Springer, New York

    Book  MATH  Google Scholar 

  2. ITRS (2008) Update. http://www.itrs.net/Links/2008ITRS/Home2008.htm

  3. Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of the 27th annual international symposium on computer architecture (ISCA ’00), pp 248–259

  4. Amodt TM (2009) Architecting graphics processors for non-graphics compute acceleration. In: IEEE Pacific Rim conference on communications, computers and signal processing, Victoria, BC, 23–26 August 2009, pp 963–968

  5. Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing graphics: processing units-powerful, programmable, and highly parallel-are increasingly targeting general-purpose computing applications. Proc IEEE 96(5):879–899

  6. Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Proceedings of EUROGRAPHICS 2005, pp 21–51

  7. NVIDIA. http://www.nvidia.com/object/what-is-gpu-computing.html

  8. Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA ’09), pp 152–163

  9. Gou C, Gaydadjiev GN (2011) Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM international conference on computing frontiers (CF ’11)

  10. Bakhoda A, Yuan G, Fung W, Wong H, Aamodt T (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software, ISPASS 2009, Boston, MA, 26–28 April 2009, pp 163–174

  11. Hong S, Kim H (2010) An integrated GPU power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture (ISCA ’10), 280–289

  12. Tarjan D, Skadron K (2010) The sharing tracker: using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In: International conference for high performance computing, networking, storage and analysis (SC), New Orleans, LA, 13–19 November 2010, pp 1–10

  13. Scogland TRW, Lin H, Feng W (2010) A first look at integrated GPUs for green high-performance computing. Comput Sci Res Dev 25:125–134

  14. Wang PH, Chen YM, Yang CL, Cheng YJ (2009) A predictive shutdown technique for GPU shader processors. IEEE Comput Archit Lett 8(1):9–12

  15. Gebhart M, Keckler SW, Khailany B, Krashinsky R, Dally WJ (2012) Unifying primary cache, scratch, and register file memories in a throughput processor. In: MICRO-45 proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 96–106

  16. Lindholm E et al. (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55

  17. NVIDIA Crop. CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/

  18. Falahati H, Abdi M, Baniasadi A, Hessabi S (2013) ISP: using idle SMs in hardware-based prefetching. In: 17th CSI international symposium on computer architecture and digital systems (CADS), 2013, Tehran, 30–31 October 2013, pp 3–8

  19. NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

  20. AMS’s Radeon. http://developer.amd.com/resources/documentation-articles/gpu-demos/radeon-hd-6900-series-graphics-real-time-demo/

  21. NVIDIAs. http://developer.nvidia.com/nvidia-gpu-computing-documentation

  22. AMD. Chu MM (2010) GPU Computing: past, present and future with ATI stream technology.

  23. Hennessey J, Patterson D (2006) Computer architecture: a quantitative approach, 4th edn. Morgan Kaufmann. http://www.amazon.com/Computer-Architecture-Quantitative-Approach-Edition/dp/0123704901

  24. Fung WL et al. (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: 40th annual IEEE/ACM international symposium on microarchitecture, 2007 (MICRO 2007), Chicago, IL, 1–5 December 2007, pp 407–420

  25. Gebhart M, Johnson DR, Tarjan D, Keckler SW, Dally WJ, Lindholm E, Skadron K (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the 38th annual international symposium on computer architecture (ISCA ’11 ), pp 235–246

  26. Gilani SZ, Kim NS, Schulte MJ (2013) Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 74–85

  27. Abdel-Majeed M, Wong D, Annavaram M (2013) Warped gates: gating aware scheduling and power gating for GPGPUs. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 111–122

  28. Leng J, Hetherington T, Eitantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. In: Proceedings of the 40th annual international symposium on computer architecture, pp 487–498

  29. Lucas J, Lal S, Andersch M, Mesa MA, Juurlink B (2013) How a single chip causes massive power bills GPUSimPow: a GPGPU power simulator. In: Proceedings of ISPASS, 2013

  30. Li S et al. (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, 2009 (MICRO-42), New York, NY, 12–16 December 2009, pp 469–480

  31. Keramidas G, Spiliopoulos V, Kaxiras S (2010) Interval-based models for run-time DVFS orchestration in superscalar processors. In: Proceedings of the 7th ACM international conference on computing frontiers (CF ’10), pp 287–296

  32. Eyerman S, Eeckhout L, Karkhanis T, Smith JE (2010) A mechanistic performance model for superscalar out-of-order processors. In: ACM Trans Comput Syst 27(2). doi:10.1145/1534909.1534910

  33. Aamodt TM et al. (2012) GPGPU-Sim 3.x Manual. University of BritishColumbi. http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual

  34. Che S et al. (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE international symposium on workload characterization, 2009 (IISWC 2009), Austin, TX, 4–6 October 2009, pp 44–54

  35. NVIDIA Corp. CUDA SDK 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloads

  36. NVIDIA Corp. CUDA SDK 3.1. https://developer.nvidia.com/cuda-toolkit-31-downloads

  37. Rofouei M, Stathopoulos T, Ryffel S, Kaiser W, Sarrafzadeh M (2008) Energy-aware high performance computing with graphic processing units. In: Proceedings of the 2008 conference on power aware computing and systems (HotPower’08), pp 11–11

  38. Huang S, Xiao S, Feng W (2009) On the energy efficiency of graphics processing units for scientific computing. In: IEEE international symposium on parallel & distributed processing, 2009 (IPDPS 2009), Rome, 23–29 May 2009, pp 1–8

  39. Jiao Y, Lin H, Balaji P, Feng W (2010) Power and performance characterization of computational kernels on the GPU. In: IEEE/ACM international conference on green computing and communications, 2010 (GreenCom’10) & international conference on cyber, physical and social computing (CPSCom), Hangzhou, 18–20 December 2010, pp 221–228

  40. Byna S, Chen Y, Sun XH (2009) Taxonomy of data prefetching for multicore processors. J Comput Sci Technol 24(3): 405–417. (Taxonomy of data prefetching for multicore processors).

  41. Woo DH, Lee HS (2010) COMPASS: a programmable data prefetcher using idle GPU shaders. In: Proceedings of the fifteenth edition of ASPLOS on architectural support for programming languages and operating systems (ASPLOS XV), pp 297–310

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hajar Falahati.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Falahati, H., Hessabi, S., Abdi, M. et al. Power-efficient prefetching on GPGPUs. J Supercomput 71, 2808–2829 (2015). https://doi.org/10.1007/s11227-014-1331-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1331-6

Keywords

Navigation