skip to main content
research-article

Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files

Published: 07 November 2017 Publication History

Abstract

A modern GPU can simultaneously process thousands of hardware threads. These threads are grouped into fixed-size SIMD batches executing the same instruction on vectors of data in a lockstep to achieve high throughput and performance. The register files are huge due to each SIMD group accessing a dedicated set of vector registers for fast context switching, and consequently the power consumption of register files has become an important issue. One proposed solution is to replace some of the vector registers by scalar registers, as different threads in a same SIMD group operate on scalar values and so the redundant computations and accesses of these scalar values can be eliminated. However, it has been observed that a significant number of registers containing affine vectors υ such that υ[i] = b + i × s can be represented by base b and stride s. Therefore, this article proposes an affine register file design for GPUs that is energy efficient due to it reducing the redundant executions of both the uniform and affine vectors. This design uses a pair of registers to store the base and stride of each affine vector and provides specific affine ALUs to execute affine instructions. A method of compiler analysis has been developed to detect scalars and affine vectors and annotate instructions for facilitating their corresponding scalar and affine computations. Furthermore, a priority-based register allocation scheme has been implemented to assign scalars and affine vectors to appropriate scalar and affine register files. Experimental results show that this design was able to dispatch 43.56% of the computations to scalar and affine ALUs when using eight scalar and four affine registers per warp. This resulted in the current design also reducing the energy consumption of the register files and ALUs to 21.86% and 26.54%, respectively, and it reduced the overall energy consumption of the GPU by an average of 5.18%.

References

[1]
Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, Los Alamitos, CA, 412--423.
[2]
AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White Paper. Retrieved September 22, 2017, from https://www.amd.com.
[3]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.
[4]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, Los Alamitos, CA, 44--54.
[5]
Zhongliang Chen, David Kaeli, and Norman Rubin. 2013. Characterizing scalar opportunities in GPGPU applications. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 225--234.
[6]
Fred C. Chow and John L. Hennessy. 1990. The priority-based coloring approach to register allocation. ACM Trans. Program. Lang. Syst. 12, 4, 501--536.
[7]
Sylvain Collange, David Defour, and Yao Zhang. 2010. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the 2009 International Conference on Parallel Processing (Euro-Par’09). 46--55.
[8]
Sylvain Collange, Alexandre Kouyoumdjian, Ens De Lyon, and Université De Lyon. 2011. Affine Vector Cache for Memory Bandwidth Savings. Technical Report. Retrieved September 22, 2017, from https://hal.inria.fr/ensl-00649200/document.
[9]
Rodrigo Dominguez, David R. Kaeli, John Cavazos, and Mike Murphy. 2009. Improving the Open64 backend for GPUs. In Proceedings of Google Summer of Code (GSoC’09).
[10]
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 235--246.
[11]
Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 465--476.
[12]
Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 330--341.
[13]
Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 280--289.
[14]
Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 344--355.
[15]
Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural mechanisms to exploit value structure in SIMT architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 130--141.
[16]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (CGO’04). IEEE, Los Alamitos, CA, 75.
[17]
Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). Los Alamitos, CA, 1--11.
[18]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 487--498.
[19]
Gushu Li, Xiaoming Chen, Guangyu Sun, Henry Hoffmann, Yongpan Liu, Yu Wang, and Huazhong Yang. 2015. A STT-RAM-based low-power hybrid register file for GPGPUs. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, New York, NY, Article No. 103, 6 pages.
[20]
Jieun Lim, Nagesh B. Lakshminarayana, Hyesoon Kim, William Song, Sudhakar Yalamanchili, and Wonyong Sung. 2014. Power modeling for GPU architectures using McPAT. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article No. 26, 24 pages.
[21]
Daniel Moth. 2012. A code-based introduction to C++ AMP. MSDN Magazine-Louisville (April), 28.
[22]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2, 40--53.
[23]
NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Whitepaper. Retrieved September 22, 2017, from https://www.nvidia.com.tw/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[24]
Nvidia. 2012. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper. Retrieved September 22, 2017, from https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[25]
NVIDIA. 2016. NVIDIA CUDA Compiler Driver NVCC. Retrieved September 22, 2017, from http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/.
[26]
NVIDIA. 2016. Parallel Thread Execution ISA Version 6.0. Retrieved September 22, 2017, from http://docs.nvidia.com/cuda/parallel-thread-execution/.
[27]
Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips Tutorial. 1--41. Available at https://www.hotchips.org/.
[28]
Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article No. 13, 36 pages.
[29]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science Engineering 12, 3, 66--73.
[30]
Jingweijia Tan, Zhi Li, Mingsong Chen, and Xin Fu. 2016. Exploring soft-error robust and energy-efficient register file in GPGPUs using resistive memory. ACM Trans. Des. Autom. Electron. Syst. 21, 2, Article No. 34, 25 pages.
[31]
Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC: First experiences with real-world applications. In Proceedings of the 18th International Conference on Parallel Processing (Euro-Par’12). 859--870.
[32]
Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE J. Solid-State Circuits 31, 5, 677--688.
[33]
Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. 2013. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In Proceedings of the 27th International Conference on Supercomputing (ICS’13). ACM, New York, NY, 433--442.
[34]
Yi Yang, Ping Xiang, Michael Mantor, Norman Rubin, Lisa Hsu, Qunfeng Dong, and Huiyang Zhou. 2014. A case for a flexible scalar unit in SIMT architecture. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE, Los Alamitos, CA, 93--102.
[35]
Ayse Yilmazer, Zhongliang Chen, and David Kaeli. 2014. Scalar waving: Improving the efficiency of SIMD execution on GPUs. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE, Los Alamitos, CA, 103--112.
[36]
Yi-Ping You and Yu-Shiuan Tsai. 2012. Compiler-assisted resource management for CUDA programs. In Proceedings of the 16th Workshop on Compilers for Parallel Computing (CPC’12).
[37]
Wing-Kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 247--258.

Cited By

View all
  • (2025)Optimizing computer vision algorithms with TVM on VLIW architecture based on RVVThe Journal of Supercomputing10.1007/s11227-024-06530-x81:1Online publication date: 1-Jan-2025
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
  • (2024)Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMDIEEE Access10.1109/ACCESS.2024.339719512(64193-64211)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 23, Issue 2
March 2018
341 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3149546
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 07 November 2017
Accepted: 01 August 2017
Revised: 01 July 2017
Received: 01 March 2017
Published in TODAES Volume 23, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Energy efficient
  2. GPU
  3. register allocation
  4. register file organization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Ministry of Science and Technology of Taiwan
  • MediaTek Inc., Hsinchu, Taiwan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Optimizing computer vision algorithms with TVM on VLIW architecture based on RVVThe Journal of Supercomputing10.1007/s11227-024-06530-x81:1Online publication date: 1-Jan-2025
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
  • (2024)Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMDIEEE Access10.1109/ACCESS.2024.339719512(64193-64211)Online publication date: 2024
  • (2023)Auto-tuning Fixed-point Precision with TVM on RISC-V Packed SIMD ExtensionACM Transactions on Design Automation of Electronic Systems10.1145/356993928:3(1-21)Online publication date: 22-Mar-2023
  • (2023)Accelerating AI performance with the incorporation of TVM and MediaTek NeuroPilotConnection Science10.1080/09540091.2023.227258635:1Online publication date: 30-Oct-2023
  • (2022)Efficient Support of the Scan Vector Model for RISC-V Vector ExtensionWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548518(1-8)Online publication date: 29-Aug-2022
  • (2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
  • (2020)A Framework for Scheduling Dependent Programs on GPU ArchitecturesJournal of Systems Architecture10.1016/j.sysarc.2020.101712(101712)Online publication date: Jan-2020
  • (2019)Devise Rust Compiler Optimizations on RISC-V Architectures with SIMD InstructionsWorkshop Proceedings of the 48th International Conference on Parallel Processing10.1145/3339186.3339193(1-7)Online publication date: 5-Aug-2019
  • (2018)Scheduling Methods to Optimize Dependent Programs for GPU ArchitectureWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229723(1-8)Online publication date: 13-Aug-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media