Abstract
The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing waiting time and the percentage of unutilized elements. Also, using processing elements in a suitable manner to prefetch data during stall times bridges the memory gap in an energy-efficient manner, and consequently leads to less power and energy consumption. Simulation results show that we can potentially improve instruction per cycle (IPC) up to 24.76 %. Moreover, results show that power, energy and energy efficiency improve by up to 22.47, 24.72 and 36.01 %, respectively.
Similar content being viewed by others
References
Keckler SW, Olukotun L, Hofstee HP (2009) Multicore processors and systems. Springer, New York
ITRS (2008) Update. http://www.itrs.net/Links/2008ITRS/Home2008.htm
Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of the 27th annual international symposium on computer architecture (ISCA ’00), pp 248–259
Amodt TM (2009) Architecting graphics processors for non-graphics compute acceleration. In: IEEE Pacific Rim conference on communications, computers and signal processing, Victoria, BC, 23–26 August 2009, pp 963–968
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing graphics: processing units-powerful, programmable, and highly parallel-are increasingly targeting general-purpose computing applications. Proc IEEE 96(5):879–899
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Proceedings of EUROGRAPHICS 2005, pp 21–51
NVIDIA. http://www.nvidia.com/object/what-is-gpu-computing.html
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA ’09), pp 152–163
Gou C, Gaydadjiev GN (2011) Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM international conference on computing frontiers (CF ’11)
Bakhoda A, Yuan G, Fung W, Wong H, Aamodt T (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software, ISPASS 2009, Boston, MA, 26–28 April 2009, pp 163–174
Hong S, Kim H (2010) An integrated GPU power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture (ISCA ’10), 280–289
Tarjan D, Skadron K (2010) The sharing tracker: using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In: International conference for high performance computing, networking, storage and analysis (SC), New Orleans, LA, 13–19 November 2010, pp 1–10
Scogland TRW, Lin H, Feng W (2010) A first look at integrated GPUs for green high-performance computing. Comput Sci Res Dev 25:125–134
Wang PH, Chen YM, Yang CL, Cheng YJ (2009) A predictive shutdown technique for GPU shader processors. IEEE Comput Archit Lett 8(1):9–12
Gebhart M, Keckler SW, Khailany B, Krashinsky R, Dally WJ (2012) Unifying primary cache, scratch, and register file memories in a throughput processor. In: MICRO-45 proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 96–106
Lindholm E et al. (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
NVIDIA Crop. CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Falahati H, Abdi M, Baniasadi A, Hessabi S (2013) ISP: using idle SMs in hardware-based prefetching. In: 17th CSI international symposium on computer architecture and digital systems (CADS), 2013, Tehran, 30–31 October 2013, pp 3–8
NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
AMS’s Radeon. http://developer.amd.com/resources/documentation-articles/gpu-demos/radeon-hd-6900-series-graphics-real-time-demo/
NVIDIAs. http://developer.nvidia.com/nvidia-gpu-computing-documentation
AMD. Chu MM (2010) GPU Computing: past, present and future with ATI stream technology.
Hennessey J, Patterson D (2006) Computer architecture: a quantitative approach, 4th edn. Morgan Kaufmann. http://www.amazon.com/Computer-Architecture-Quantitative-Approach-Edition/dp/0123704901
Fung WL et al. (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: 40th annual IEEE/ACM international symposium on microarchitecture, 2007 (MICRO 2007), Chicago, IL, 1–5 December 2007, pp 407–420
Gebhart M, Johnson DR, Tarjan D, Keckler SW, Dally WJ, Lindholm E, Skadron K (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the 38th annual international symposium on computer architecture (ISCA ’11 ), pp 235–246
Gilani SZ, Kim NS, Schulte MJ (2013) Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 74–85
Abdel-Majeed M, Wong D, Annavaram M (2013) Warped gates: gating aware scheduling and power gating for GPGPUs. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 111–122
Leng J, Hetherington T, Eitantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. In: Proceedings of the 40th annual international symposium on computer architecture, pp 487–498
Lucas J, Lal S, Andersch M, Mesa MA, Juurlink B (2013) How a single chip causes massive power bills GPUSimPow: a GPGPU power simulator. In: Proceedings of ISPASS, 2013
Li S et al. (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, 2009 (MICRO-42), New York, NY, 12–16 December 2009, pp 469–480
Keramidas G, Spiliopoulos V, Kaxiras S (2010) Interval-based models for run-time DVFS orchestration in superscalar processors. In: Proceedings of the 7th ACM international conference on computing frontiers (CF ’10), pp 287–296
Eyerman S, Eeckhout L, Karkhanis T, Smith JE (2010) A mechanistic performance model for superscalar out-of-order processors. In: ACM Trans Comput Syst 27(2). doi:10.1145/1534909.1534910
Aamodt TM et al. (2012) GPGPU-Sim 3.x Manual. University of BritishColumbi. http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual
Che S et al. (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE international symposium on workload characterization, 2009 (IISWC 2009), Austin, TX, 4–6 October 2009, pp 44–54
NVIDIA Corp. CUDA SDK 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloads
NVIDIA Corp. CUDA SDK 3.1. https://developer.nvidia.com/cuda-toolkit-31-downloads
Rofouei M, Stathopoulos T, Ryffel S, Kaiser W, Sarrafzadeh M (2008) Energy-aware high performance computing with graphic processing units. In: Proceedings of the 2008 conference on power aware computing and systems (HotPower’08), pp 11–11
Huang S, Xiao S, Feng W (2009) On the energy efficiency of graphics processing units for scientific computing. In: IEEE international symposium on parallel & distributed processing, 2009 (IPDPS 2009), Rome, 23–29 May 2009, pp 1–8
Jiao Y, Lin H, Balaji P, Feng W (2010) Power and performance characterization of computational kernels on the GPU. In: IEEE/ACM international conference on green computing and communications, 2010 (GreenCom’10) & international conference on cyber, physical and social computing (CPSCom), Hangzhou, 18–20 December 2010, pp 221–228
Byna S, Chen Y, Sun XH (2009) Taxonomy of data prefetching for multicore processors. J Comput Sci Technol 24(3): 405–417. (Taxonomy of data prefetching for multicore processors).
Woo DH, Lee HS (2010) COMPASS: a programmable data prefetcher using idle GPU shaders. In: Proceedings of the fifteenth edition of ASPLOS on architectural support for programming languages and operating systems (ASPLOS XV), pp 297–310
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Falahati, H., Hessabi, S., Abdi, M. et al. Power-efficient prefetching on GPGPUs. J Supercomput 71, 2808–2829 (2015). https://doi.org/10.1007/s11227-014-1331-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1331-6