ABSTRACT
GPU utilization, measured as occupancy, is limited by the parallel threads' combined usage of on-chip resources. If the resource demand cannot be met, GPUs will reduce the number of concurrent threads, impacting the program performance. We have observed that registers are the occupancy limiters while shared metmory tends to be underused. The de facto approach spills excessive registers to the out-of-chip memory, ignoring the shared memory and leaving the on-chip resources underutilized. To mitigate the register demand, our work presents a novel compiler technique, called register demotion, that allows data in the register to be placed into the underutilized shared memory by transforming the GPU assembly code (SASS). Register demotion achieves up to 18% speedup over the nvcc compiler, with a geometric mean of 7%.
- Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10) (IISWC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarDigital Library
- Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63--74. Google ScholarDigital Library
- Ari B. Hayes and Eddy Z. Zhang. 2014. Unified On-chip Memory Allocation for SIMT Architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM, New York, NY, USA, 293--302. Google ScholarDigital Library
- Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016. 2:1--2:12. Google ScholarDigital Library
- NVIDIA. 2017. CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide. (2017). {Online; accessed 2-April-2017}.Google Scholar
- NVIDIA. 2017. CUDA Toolkit Documentation - CUDA Samples. http://docs.nvidia.com/cuda/cuda-samples. (2017). {Online; accessed 1-April-2017}.Google Scholar
- Diogo Nunes Sampaio, Elie Gedeon, Fernando Magno Quintão Pereira, and Sylvain Collange. 2012. Spill Code Placement for SIMD Machines. Springer Berlin Heidelberg, Berlin, Heidelberg, 12--26. Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406. Google ScholarDigital Library
Index Terms
- Optimizing GPU programs by register demotion: poster
Recommendations
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
PPoPP '17In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU ...
Register coalescing techniques for heterogeneous register architecture with copy sifting
Optimistic coalescing has been proven as an elegant and effective technique that provides better chances of safely coloring more registers in register allocation than other coalescing techniques. Its algorithm originally assumes homogeneous registers, ...
CORF: Coalescing Operand Register File for GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsThe Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially affects the overall performance and the energy efficiency of a GPU. For ...
Comments