ABSTRACT
Value similarity of operands across warps have been exploited to improve energy efficiency of GPUs. Prior work, however, incurs significant overheads to check value similarity for every instruction and does not improve performance as it does not reduce the number of executed instructions. This work proposes Lock 'n Load (LnL) which triggers approximate execution of code regions by only checking similarity of values returned from load instructions and fuses multiple approximated warps into a single warp.
- M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google ScholarDigital Library
- D. Wong, N. S. Kim, and M. Annavaram. Approximating Warps with Intra-warp Operand Value Similarity. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.Google Scholar
- S. Z. Gilani, N. S. Kim, and M.J. Schulte. Power-efficient Computing for Compute-intensive GPGPU Applications. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2013. Google ScholarDigital Library
- A. Yilmazer, Z. Chen, and D. Kaeli. Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014. Google ScholarDigital Library
- S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. Warped-Compression: Enabling Power Efficient GPUs through Register Compression. In IEEE/ACM International Symposium on Computer Architecture (ISCA), 2015. Google ScholarDigital Library
- Z. Liu, S. Gilani, M. Annavaram, and N. S. Kim. G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization (IISWC), pages 44--54, Oct 2009. Google ScholarDigital Library
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V.J. Reddi. GPUWattch: Enabling Energy Optimizations in GPGPUs. In IEEE/ACM International Symposium on Computer Architecture (ISCA), 2013. Google ScholarDigital Library
- H. Jeon, G.S. Ravi, N.S. Kim, and M. Annavaram. GPU Register File Virtualization. In IEEE/ACM International Symposium on Microarchitecture (MICRO), 2015. Google ScholarDigital Library
- A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.Google ScholarCross Ref
- G. Ziegler. Textures and Surfaces. URL http://on-demand.gputechconf.com/gtc-express/2011/presentations/texture_webinar_aug_2011.pdf.Google Scholar
- NVIDIA. Fermi Architecture Whitepaper. URL http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf.Google Scholar
- NVIDIA. CUDA Toolkit 4.0. URL https://developer.nvidia.com/cuda-toolkit-40.Google Scholar
Index Terms
- Load-Triggered Warp Approximation on GPU
Recommendations
Neural acceleration for GPU throughput processors
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureGraphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application ...
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (...
Taming warp divergence
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and OptimizationGraphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming
Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution
...
Comments