poster

Optimizing GPU programs by register demotion: poster

Authors:
Putt Sakdhnagool

National Electronics and Computer Technology Center, Pathum Thani, Thailand

National Electronics and Computer Technology Center, Pathum Thani, Thailand
View Profile

,
Amit Sabne

Google Brain

Google Brain
View Profile

,
Rudolf Eigenmann

University of Delaware

University of Delaware
View Profile

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingFebruary 2019Pages 405–406https://doi.org/10.1145/3293883.3297859

Published:16 February 2019Publication History

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 405–406

ABSTRACT

GPU utilization, measured as occupancy, is limited by the parallel threads' combined usage of on-chip resources. If the resource demand cannot be met, GPUs will reduce the number of concurrent threads, impacting the program performance. We have observed that registers are the occupancy limiters while shared metmory tends to be underused. The de facto approach spills excessive registers to the out-of-chip memory, ignoring the shared memory and leaving the on-chip resources underutilized. To mitigate the register demand, our work presents a novel compiler technique, called register demotion, that allows data in the register to be placed into the underutilized shared memory by transforming the GPU assembly code (SASS). Register demotion achieves up to 18% speedup over the nvcc compiler, with a geometric mean of 7%.

References

Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10) (IISWC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarDigital Library
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. Google ScholarDigital Library
Ari B. Hayes and Eddy Z. Zhang. 2014. Unified On-chip Memory Allocation for SIMT Architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM, New York, NY, USA, 293--302. Google ScholarDigital Library
Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016. 2:1--2:12. Google ScholarDigital Library
NVIDIA. 2017. CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide. (2017). {Online; accessed 2-April-2017}.Google Scholar
NVIDIA. 2017. CUDA Toolkit Documentation - CUDA Samples. http://docs.nvidia.com/cuda/cuda-samples. (2017). {Online; accessed 1-April-2017}.Google Scholar
Diogo Nunes Sampaio, Elie Gedeon, Fernando Magno Quintão Pereira, and Sylvain Collange. 2012. Spill Code Placement for SIMD Machines. Springer Berlin Heidelberg, Berlin, Heidelberg, 12--26. Google ScholarDigital Library
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406. Google ScholarDigital Library

Index Terms

Optimizing GPU programs by register demotion: poster
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
PPoPP '17

In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU ...
Read More
Register coalescing techniques for heterogeneous register architecture with copy sifting

Optimistic coalescing has been proven as an elegant and effective technique that provides better chances of safely coloring more registers in register allocation than other coalescing techniques. Its algorithm originally assumes homogeneous registers, ...
Read More
CORF: Coalescing Operand Register File for GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

The Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially affects the overall performance and the energy efficiency of a GPU. For ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
ISBN:9781450362252
DOI:10.1145/3293883
General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel
Copyright © 2019 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2019
Check for updates
Author Tags
GPU
assembler
compiler
register spilling
Qualifiers
- poster
Conference

Acceptance Rates
PPoPP '19 Paper Acceptance Rate29of152submissions,19%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 150
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing GPU programs by register demotion: poster

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Register coalescing techniques for heterogeneous register architecture with copy sifting

CORF: Coalescing Operand Register File for GPUs