Power-efficient prefetching on GPGPUs

Falahati, Hajar; Hessabi, Shaahin; Abdi, Mania; Baniasadi, Amirali

doi:10.1007/s11227-014-1331-6

Power-efficient prefetching on GPGPUs

Published: 05 December 2014

Volume 71, pages 2808–2829, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hajar Falahati¹,
Shaahin Hessabi¹,
Mania Abdi¹ &
…
Amirali Baniasadi²

335 Accesses
6 Citations
Explore all metrics

Abstract

The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing waiting time and the percentage of unutilized elements. Also, using processing elements in a suitable manner to prefetch data during stall times bridges the memory gap in an energy-efficient manner, and consequently leads to less power and energy consumption. Simulation results show that we can potentially improve instruction per cycle (IPC) up to 24.76 %. Moreover, results show that power, energy and energy efficiency improve by up to 22.47, 24.72 and 36.01 %, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Keckler SW, Olukotun L, Hofstee HP (2009) Multicore processors and systems. Springer, New York
Book MATH Google Scholar
ITRS (2008) Update. http://www.itrs.net/Links/2008ITRS/Home2008.htm
Agarwal V, Hrishikesh MS, Keckler SW, Burger D (2000) Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proceedings of the 27th annual international symposium on computer architecture (ISCA ’00), pp 248–259
Amodt TM (2009) Architecting graphics processors for non-graphics compute acceleration. In: IEEE Pacific Rim conference on communications, computers and signal processing, Victoria, BC, 23–26 August 2009, pp 963–968
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing graphics: processing units-powerful, programmable, and highly parallel-are increasingly targeting general-purpose computing applications. Proc IEEE 96(5):879–899
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2005) A survey of general-purpose computation on graphics hardware. In: Proceedings of EUROGRAPHICS 2005, pp 21–51
NVIDIA. http://www.nvidia.com/object/what-is-gpu-computing.html
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA ’09), pp 152–163
Gou C, Gaydadjiev GN (2011) Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM international conference on computing frontiers (CF ’11)
Bakhoda A, Yuan G, Fung W, Wong H, Aamodt T (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software, ISPASS 2009, Boston, MA, 26–28 April 2009, pp 163–174
Hong S, Kim H (2010) An integrated GPU power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture (ISCA ’10), 280–289
Tarjan D, Skadron K (2010) The sharing tracker: using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In: International conference for high performance computing, networking, storage and analysis (SC), New Orleans, LA, 13–19 November 2010, pp 1–10
Scogland TRW, Lin H, Feng W (2010) A first look at integrated GPUs for green high-performance computing. Comput Sci Res Dev 25:125–134
Wang PH, Chen YM, Yang CL, Cheng YJ (2009) A predictive shutdown technique for GPU shader processors. IEEE Comput Archit Lett 8(1):9–12
Gebhart M, Keckler SW, Khailany B, Krashinsky R, Dally WJ (2012) Unifying primary cache, scratch, and register file memories in a throughput processor. In: MICRO-45 proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 96–106
Lindholm E et al. (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
NVIDIA Crop. CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Falahati H, Abdi M, Baniasadi A, Hessabi S (2013) ISP: using idle SMs in hardware-based prefetching. In: 17th CSI international symposium on computer architecture and digital systems (CADS), 2013, Tehran, 30–31 October 2013, pp 3–8
NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
AMS’s Radeon. http://developer.amd.com/resources/documentation-articles/gpu-demos/radeon-hd-6900-series-graphics-real-time-demo/
NVIDIAs. http://developer.nvidia.com/nvidia-gpu-computing-documentation
AMD. Chu MM (2010) GPU Computing: past, present and future with ATI stream technology.
Hennessey J, Patterson D (2006) Computer architecture: a quantitative approach, 4th edn. Morgan Kaufmann. http://www.amazon.com/Computer-Architecture-Quantitative-Approach-Edition/dp/0123704901
Fung WL et al. (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: 40th annual IEEE/ACM international symposium on microarchitecture, 2007 (MICRO 2007), Chicago, IL, 1–5 December 2007, pp 407–420
Gebhart M, Johnson DR, Tarjan D, Keckler SW, Dally WJ, Lindholm E, Skadron K (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the 38th annual international symposium on computer architecture (ISCA ’11 ), pp 235–246
Gilani SZ, Kim NS, Schulte MJ (2013) Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 74–85
Abdel-Majeed M, Wong D, Annavaram M (2013) Warped gates: gating aware scheduling and power gating for GPGPUs. In: Proceedings of the 46th annual IEEE/ACM international symposium on microarchitecture (MICRO-46), pp 111–122
Leng J, Hetherington T, Eitantawy A, Gilani S, Kim NS, Aamodt TM, Reddi VJ (2013) GPUWattch: enabling energy optimizations in GPGPUs. In: Proceedings of the 40th annual international symposium on computer architecture, pp 487–498
Lucas J, Lal S, Andersch M, Mesa MA, Juurlink B (2013) How a single chip causes massive power bills GPUSimPow: a GPGPU power simulator. In: Proceedings of ISPASS, 2013
Li S et al. (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, 2009 (MICRO-42), New York, NY, 12–16 December 2009, pp 469–480
Keramidas G, Spiliopoulos V, Kaxiras S (2010) Interval-based models for run-time DVFS orchestration in superscalar processors. In: Proceedings of the 7th ACM international conference on computing frontiers (CF ’10), pp 287–296
Eyerman S, Eeckhout L, Karkhanis T, Smith JE (2010) A mechanistic performance model for superscalar out-of-order processors. In: ACM Trans Comput Syst 27(2). doi:10.1145/1534909.1534910
Aamodt TM et al. (2012) GPGPU-Sim 3.x Manual. University of BritishColumbi. http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual
Che S et al. (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE international symposium on workload characterization, 2009 (IISWC 2009), Austin, TX, 4–6 October 2009, pp 44–54
NVIDIA Corp. CUDA SDK 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloads
NVIDIA Corp. CUDA SDK 3.1. https://developer.nvidia.com/cuda-toolkit-31-downloads
Rofouei M, Stathopoulos T, Ryffel S, Kaiser W, Sarrafzadeh M (2008) Energy-aware high performance computing with graphic processing units. In: Proceedings of the 2008 conference on power aware computing and systems (HotPower’08), pp 11–11
Huang S, Xiao S, Feng W (2009) On the energy efficiency of graphics processing units for scientific computing. In: IEEE international symposium on parallel & distributed processing, 2009 (IPDPS 2009), Rome, 23–29 May 2009, pp 1–8
Jiao Y, Lin H, Balaji P, Feng W (2010) Power and performance characterization of computational kernels on the GPU. In: IEEE/ACM international conference on green computing and communications, 2010 (GreenCom’10) & international conference on cyber, physical and social computing (CPSCom), Hangzhou, 18–20 December 2010, pp 221–228
Byna S, Chen Y, Sun XH (2009) Taxonomy of data prefetching for multicore processors. J Comput Sci Technol 24(3): 405–417. (Taxonomy of data prefetching for multicore processors).
Woo DH, Lee HS (2010) COMPASS: a programmable data prefetcher using idle GPU shaders. In: Proceedings of the fifteenth edition of ASPLOS on architectural support for programming languages and operating systems (ASPLOS XV), pp 297–310

Download references

Author information

Authors and Affiliations

Sharif University of Technology, Tehran, Iran
Hajar Falahati, Shaahin Hessabi & Mania Abdi
University of Victoria, Victoria, Canada
Amirali Baniasadi

Authors

Hajar Falahati
View author publications
You can also search for this author in PubMed Google Scholar
Shaahin Hessabi
View author publications
You can also search for this author in PubMed Google Scholar
Mania Abdi
View author publications
You can also search for this author in PubMed Google Scholar
Amirali Baniasadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hajar Falahati.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Falahati, H., Hessabi, S., Abdi, M. et al. Power-efficient prefetching on GPGPUs. J Supercomput 71, 2808–2829 (2015). https://doi.org/10.1007/s11227-014-1331-6

Download citation

Published: 05 December 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s11227-014-1331-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Power-efficient prefetching on GPGPUs

Abstract

Access this article

Similar content being viewed by others

GPU Architecture

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

Stream data prefetcher for the GPU memory interface

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Power-efficient prefetching on GPGPUs

Abstract

Access this article

Similar content being viewed by others

GPU Architecture

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

Stream data prefetcher for the GPU memory interface

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation