skip to main content
10.1145/2540708.2540716acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Published: 07 December 2013 Publication History

Abstract

Modern GPUs share limited hardware resources, such as register files, among a large number of concurrently executing threads. For efficient resource sharing, several buffering and collision avoidance stages are inserted in the GPU pipeline. These additional stages increase the read-after-write (RAW) latencies of instructions. Since GPUs are often architected to hide RAW latencies through extensive multithreading, they typically do not employ power-hungry data-forwarding networks (DFNs). However, we observe that many GPGPU applications do not have enough active threads that are ready to issue instructions to hide these RAW latencies. In this paper, we first demonstrate that DFNs can considerably improve the performance of many compute-intensive GPGPU applications and then propose most recent result forwarding (MoRF) as a low-power alternative to the DFN. Second, for floating-point (FP) operations, we exploit a high-throughput fused multiply-add (HFMA) unit to further reduce both RAW latencies and the number of FMA units in the GPU without impacting instruction throughput. MoRF and HFMA together provide a geometric mean performance improvement of 18% and 29% for integer/single-precision and double-precision GPGPU applications, respectively. Finally, both MoRF and HFMA allow the GPU to effectively mimic a shallower pipeline for a large percentage of instructions. Exploiting such a benefit, we propose low-power pipelines that can reduce peak power consumption by 14% without affecting the performance or increasing the complexity of the forwarding network. The peak power reduction allows GPUs to operate more cores within the same power budget, achieving a geometric mean performance improvement of 33% for double-precision GPGPU applications.

References

[1]
S. Liu, E. Lindholm, M. Siu, B. Coon and S. Oberman, "Operand collector architecture". US Patent 7,834,881 B2, 16 November 2010.
[2]
B. Nordquist and S. Lew, "Apparatus, system, and method for coalescing parallel memory requests". US Patent 7492368, 17 February 2009.
[3]
NVIDIA, "CUDA C best practices guide," 2011.
[4]
NVIDIA, "NVIDIA's next generation CUDA compute architecture: Fermi," 2009.
[5]
Nvidia, {Online}. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf.
[6]
Microway, "GPGPU architecture comparison of NVIDIA and ATI GPUs," June 2010. {Online}. Available: http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf.
[7]
Nvidia, 2012. {Online}. Available: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications.
[8]
R. Zhang, B. Meyer, W. Huang, K. Skadron and M. Stan, "Some limits of power delivery in the multicore era," in Workshop on Energy efficient Design, 2012.
[9]
Intel, Intel® Core#8482; i7-900 desktop processor extreme edition series and Intel® Core#8482; i7-900 desktop processor series, 2010.
[10]
M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in International symposium on Computer architecture, 2011.
[11]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE International Symposium on Workload Characterization, 2009.
[12]
D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguilera, A. Nagarajan, M. Anderson, M. Kenny, S. Bauer, M. Schulte and K. Compton, "ERCBench: An open-source benchmark suite for embedded and reconfigurable computing," in International Conference on Field Programmable Logic and Applications, 2010.
[13]
A. Bakhoda, G. Yuan, W. Fung, H. Wong and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009.
[14]
Nvidia®, "CUDA warps and occupancy," {Online}. Available: http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf.
[15]
J. Leng, T. Heatherington, A. ElShafiey, S. Gilani, N. KIm, T. Aamodt and V. Reddi, "GPUWattch: enabling energy optimizations in GPGPUs," in IEEE International Symposium on Computer Architecture (2013).
[16]
S. Hong and H. Kim, "An integrated GPU power and performance model," in International Symposium on Computer Architecture, 2010.
[17]
J. Preiss, M. Boersma and S. Mueller, "Advanced clockgating schemes for fused-multiply-add-type floating-point units," in IEEE Symposium on Computer Arithmetic, 2009.
[18]
M. Ercegovac and T. Lang, Digital arithmetic, Morgan Kaufmann, 2004.
[19]
S. Gilani, N. Kim and M. Schulte, "Energy-efficient floating-point arithmetic for software-defined radio architectures," in IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2011.
[20]
R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava and S. Kulkarni, "Pushing ASIC performance in a power envelope," in Proceedings of the 40th annual Design Automation Conference, 2003.
[21]
M. Khellah and M. Elmasry, "Power minimization of high-performance submicron CMOS circuits using a dual-V/sub dd/dual-V/sub th/(DVDV) approach," in International Symposium on Low Power Electronics and Design, 1999.
[22]
D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S. Gould and J. Cohn, "Managing power and performance for System-on-Chip designs using Voltage Islands," in Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, 2002.
[23]
L. Chang, R. Montoye, B. Ji, A. Weger, K. Stawiasz and R. Dennard, "A fully-integrated switched-capacitor 2:1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm2," in IEEE Symposium on VLSI Circuits (VLSIC), 2010.
[24]
S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen and N. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in IEEE/ACM International Symposium on Microarchitecture, 2009.
[25]
HP, {Online}. Available: http://quid.hp1.hp.com:9081/cacti.
[26]
S. Galal and M. Horowitz, "Energy-efficient floating-point unit design," IEEE Transactions on Computers, vol. 60, no. 7, pp. 913--922, 2011.
[27]
M. Gebhart, S. Keckler and W. Dally, "A compile-time managed multi-level register file hierarchy," in Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2011.
[28]
H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi and A. Moshovos, "Demystifying GPU microarchitecture through microbenchmarking," in IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.
[29]
V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in IEEE/ACM International Symposium on Microarchitecture, 2011.
[30]
V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. Strenski and P. Emma, "Optimizing pipelines for power and performance," in IEEE/ACM International Symposium on Microarchitecture, 2002.
[31]
J. Sartori, B. A. and R. K., "Power balanced pipelines," in IEEE 18th International Symposium on High Performance Computer Architecture, 2012.
[32]
Z. Luo and M. Martonosi, "Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques," IEEE Transactions on Computers, vol. 49, no. 3, pp. 208--218, March 2000.

Cited By

View all
  • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
  • (2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
  • (2018)CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUsIEEE Transactions on Computers10.1109/TC.2017.277627267:6(890-897)Online publication date: 1-Jun-2018
  • Show More Cited By

Index Terms

  1. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2013
    498 pages
    ISBN:9781450326384
    DOI:10.1145/2540708
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPUs
    2. low-power
    3. pipeline latencies

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-46
    Sponsor:

    Acceptance Rates

    MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
    • (2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
    • (2018)CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUsIEEE Transactions on Computers10.1109/TC.2017.277627267:6(890-897)Online publication date: 1-Jun-2018
    • (2018)WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00041(389-402)Online publication date: Feb-2018
    • (2017)An Energy-Efficient GPGPU Register File Architecture Using Racetrack MemoryIEEE Transactions on Computers10.1109/TC.2017.269085566:9(1478-1490)Online publication date: 1-Sep-2017
    • (2016)μC-StatesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967941(17-30)Online publication date: 11-Sep-2016
    • (2015)Enabling coordinated register allocation and thread-level parallelism optimization for GPUsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830813(395-406)Online publication date: 5-Dec-2015
    • (2015)Power-efficient prefetching on GPGPUsThe Journal of Supercomputing10.1007/s11227-014-1331-671:8(2808-2829)Online publication date: 1-Aug-2015
    • (2014)A Survey of Methods for Analyzing and Improving GPU Energy EfficiencyACM Computing Surveys10.1145/263634247:2(1-23)Online publication date: 25-Aug-2014

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media