research-article

Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Authors:

Syed Zohaib Gilani,

Michael J. SchulteAuthors Info & Claims

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 74 - 85

https://doi.org/10.1145/2540708.2540716

Published: 07 December 2013 Publication History

Abstract

Modern GPUs share limited hardware resources, such as register files, among a large number of concurrently executing threads. For efficient resource sharing, several buffering and collision avoidance stages are inserted in the GPU pipeline. These additional stages increase the read-after-write (RAW) latencies of instructions. Since GPUs are often architected to hide RAW latencies through extensive multithreading, they typically do not employ power-hungry data-forwarding networks (DFNs). However, we observe that many GPGPU applications do not have enough active threads that are ready to issue instructions to hide these RAW latencies. In this paper, we first demonstrate that DFNs can considerably improve the performance of many compute-intensive GPGPU applications and then propose most recent result forwarding (MoRF) as a low-power alternative to the DFN. Second, for floating-point (FP) operations, we exploit a high-throughput fused multiply-add (HFMA) unit to further reduce both RAW latencies and the number of FMA units in the GPU without impacting instruction throughput. MoRF and HFMA together provide a geometric mean performance improvement of 18% and 29% for integer/single-precision and double-precision GPGPU applications, respectively. Finally, both MoRF and HFMA allow the GPU to effectively mimic a shallower pipeline for a large percentage of instructions. Exploiting such a benefit, we propose low-power pipelines that can reduce peak power consumption by 14% without affecting the performance or increasing the complexity of the forwarding network. The peak power reduction allows GPUs to operate more cores within the same power budget, achieving a geometric mean performance improvement of 33% for double-precision GPGPU applications.

References

[1]

S. Liu, E. Lindholm, M. Siu, B. Coon and S. Oberman, "Operand collector architecture". US Patent 7,834,881 B2, 16 November 2010.

[2]

B. Nordquist and S. Lew, "Apparatus, system, and method for coalescing parallel memory requests". US Patent 7492368, 17 February 2009.

[3]

NVIDIA, "CUDA C best practices guide," 2011.

[4]

NVIDIA, "NVIDIA's next generation CUDA compute architecture: Fermi," 2009.

[5]

Nvidia, {Online}. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf.

[6]

Microway, "GPGPU architecture comparison of NVIDIA and ATI GPUs," June 2010. {Online}. Available: http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf.

[7]

Nvidia, 2012. {Online}. Available: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications.

[8]

R. Zhang, B. Meyer, W. Huang, K. Skadron and M. Stan, "Some limits of power delivery in the multicore era," in Workshop on Energy efficient Design, 2012.

[9]

Intel, Intel® Core#8482; i7-900 desktop processor extreme edition series and Intel® Core#8482; i7-900 desktop processor series, 2010.

[10]

M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in International symposium on Computer architecture, 2011.

Digital Library

[11]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE International Symposium on Workload Characterization, 2009.

Digital Library

[12]

D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan, M. Anderson, M. Kenny, S. Bauer, M. Schulte and K. Compton, "ERCBench: An open-source benchmark suite for embedded and reconfigurable computing," in International Conference on Field Programmable Logic and Applications, 2010.

Digital Library

[13]

A. Bakhoda, G. Yuan, W. Fung, H. Wong and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009.

[14]

Nvidia®, "CUDA warps and occupancy," {Online}. Available: http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf.

[15]

J. Leng, T. Heatherington, A. ElShafiey, S. Gilani, N. KIm, T. Aamodt and V. Reddi, "GPUWattch: enabling energy optimizations in GPGPUs," in IEEE International Symposium on Computer Architecture (2013).

Digital Library

[16]

S. Hong and H. Kim, "An integrated GPU power and performance model," in International Symposium on Computer Architecture, 2010.

Digital Library

[17]

J. Preiss, M. Boersma and S. Mueller, "Advanced clockgating schemes for fused-multiply-add-type floating-point units," in IEEE Symposium on Computer Arithmetic, 2009.

Digital Library

[18]

M. Ercegovac and T. Lang, Digital arithmetic, Morgan Kaufmann, 2004.

[19]

S. Gilani, N. Kim and M. Schulte, "Energy-efficient floating-point arithmetic for software-defined radio architectures," in IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2011.

Digital Library

[20]

R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava and S. Kulkarni, "Pushing ASIC performance in a power envelope," in Proceedings of the 40th annual Design Automation Conference, 2003.

Digital Library

[21]

M. Khellah and M. Elmasry, "Power minimization of high-performance submicron CMOS circuits using a dual-V/sub dd/dual-V/sub th/(DVDV) approach," in International Symposium on Low Power Electronics and Design, 1999.

Digital Library

[22]

D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S. Gould and J. Cohn, "Managing power and performance for System-on-Chip designs using Voltage Islands," in Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, 2002.

Digital Library

[23]

L. Chang, R. Montoye, B. Ji, A. Weger, K. Stawiasz and R. Dennard, "A fully-integrated switched-capacitor 2:1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm2," in IEEE Symposium on VLSI Circuits (VLSIC), 2010.

[24]

S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen and N. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in IEEE/ACM International Symposium on Microarchitecture, 2009.

Digital Library

[25]

HP, {Online}. Available: http://quid.hp1.hp.com:9081/cacti.

[26]

S. Galal and M. Horowitz, "Energy-efficient floating-point unit design," IEEE Transactions on Computers, vol. 60, no. 7, pp. 913--922, 2011.

Digital Library

[27]

M. Gebhart, S. Keckler and W. Dally, "A compile-time managed multi-level register file hierarchy," in Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2011.

Digital Library

[28]

H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi and A. Moshovos, "Demystifying GPU microarchitecture through microbenchmarking," in IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.

[29]

V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in IEEE/ACM International Symposium on Microarchitecture, 2011.

Digital Library

[30]

V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. Strenski and P. Emma, "Optimizing pipelines for power and performance," in IEEE/ACM International Symposium on Microarchitecture, 2002.

Digital Library

[31]

J. Sartori, B. A. and R. K., "Power balanced pipelines," in IEEE 18th International Symposium on High Performance Computer Architecture, 2012.

Digital Library

[32]

Z. Luo and M. Martonosi, "Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques," IEEE Transactions on Computers, vol. 49, no. 3, pp. 208--218, March 2000.

Digital Library

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Sadrosadati MEhsani SFalahati HAusavarungnirun RTavakkol AAbaee MOrosa LWang YSarbazi-Azad HMutlu O(2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
https://dl.acm.org/doi/10.1145/3291606
Xie XLiang YLi XWu YSun GWang TFan D(2018)CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUsIEEE Transactions on Computers10.1109/TC.2017.277627267:6(890-897)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TC.2017.2776272
Show More Cited By

Index Terms

Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Power and Performance Characterization of Computational Kernels on the GPU
GREENCOM-CPSCOM '10: Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing

Nowadays Graphic Processing Units (GPU) are gaining increasing popularity in high performance computing (HPC). While modern GPUs can offer much more computational power than CPUs, they also consume much more power. Energy efficiency is one of the most ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 2013

498 pages

ISBN:9781450326384

DOI:10.1145/2540708

General Chair:
Matthew Farrens
UC Davis
,
Program Chair:
Christos Kozyrakis
Stanford University

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MICRO-46

Sponsor:

SIGMICRO

MICRO-46: The 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 7 - 11, 2013

California, Davis

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
672
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Sadrosadati MEhsani SFalahati HAusavarungnirun RTavakkol AAbaee MOrosa LWang YSarbazi-Azad HMutlu O(2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
https://dl.acm.org/doi/10.1145/3291606
Xie XLiang YLi XWu YSun GWang TFan D(2018)CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUsIEEE Transactions on Computers10.1109/TC.2017.277627267:6(890-897)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TC.2017.2776272
Kim KRo W(2018)WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00041(389-402)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00041
Mao MWen WZhang YChen YLi H(2017)An Energy-Efficient GPGPU Register File Architecture Using Racetrack MemoryIEEE Transactions on Computers10.1109/TC.2017.269085566:9(1478-1490)Online publication date: 1-Sep-2017
https://doi.org/10.1109/TC.2017.2690855
Kayiran OJog APattnaik AAusavarungnirun RTang XKandemir MLoh GMutlu ODas CZaks AMendelson BRauchwerger LHwu W(2016)μC-StatesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967941(17-30)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967941
Xie XLiang YLi XWu YSun GWang TFan DPrvulovic M(2015)Enabling coordinated register allocation and thread-level parallelism optimization for GPUsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830813(395-406)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830813
Falahati HHessabi SAbdi MBaniasadi A(2015)Power-efficient prefetching on GPGPUsThe Journal of Supercomputing10.1007/s11227-014-1331-671:8(2808-2829)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1007/s11227-014-1331-6
Mittal SVetter J(2014)A Survey of Methods for Analyzing and Improving GPU Energy EfficiencyACM Computing Surveys10.1145/263634247:2(1-23)Online publication date: 25-Aug-2014
https://dl.acm.org/doi/10.1145/2636342

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten