ABSTRACT
Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations.
To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.
- Advanced Micro Devices, Inc. ATI Stream Technology. http://www.amd.com/stream.Google Scholar
- A. Agarwal et al. April: a processor architecture for multiprocessing. In ISCA-17, 1990. Google ScholarDigital Library
- B. Amrutur and M. Horowitz. Speed and power scaling of SRAMs. IEEE JSCC, 35(2):175--185, Feb. 2000.Google Scholar
- W. J. Bouknight et al. The Illiac IV system. Proceedings of the IEEE, 60(4):369--388, Apr. 1972.Google ScholarCross Ref
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarDigital Library
- W. W. L. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In HPCA-17, 2011. Google ScholarDigital Library
- W. W. L. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO-40, 2007. Google ScholarDigital Library
- W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM TACO, 6(2):1--37, June 2009. Google ScholarDigital Library
- W.-M. Hwu et al. Compute unified device architecture application suitability. Computing in Science Engineering, may-jun 2009. Google ScholarDigital Library
- N. Jayasena et al. Stream register files with indexed access. In HPCA-10, 2004. Google ScholarDigital Library
- U. Kapasi et al. Efficient conditional operations for data-parallel architectures. In MICRO-33, 2000. Google ScholarDigital Library
- B. Khailany et al. Vlsi design and verification of the imagine processor. In ICCD, 2002.Google ScholarCross Ref
- Khronos Group. OpenCL. http://www.khronos.org/opencl.Google Scholar
- D. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Elsevier Science, 2010. Google ScholarDigital Library
- R. Krashinsky et al. The vector-thread architecture. In ISCA-31, 2004. Google ScholarDigital Library
- N. B. Lakshminarayana and H. Kim. Effect of instruction fetch and memory scheduling on gpu performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.Google Scholar
- J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA-37, 2010. Google ScholarDigital Library
- R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.Google ScholarCross Ref
- NVIDIA. CUDA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google Scholar
- NVIDIA. CUDA Programming Guide Version 3.0, 2010.Google Scholar
- NVIDIA. PTX ISA Version 2.0, 2010.Google Scholar
- R. M. Russell. The CRAY-1 computer system. Communications of the ACM, 21(1):63--72, Jan. 1978. Google ScholarDigital Library
- S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google ScholarDigital Library
- B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.Google Scholar
- J. E. Smith et al. Vector instruction set support for conditional operations. In ISCA-27, 2000. Google ScholarDigital Library
- J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965. Google ScholarDigital Library
- D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO-34, 2001. Google ScholarDigital Library
Index Terms
- Improving GPU performance via large warps and two-level warp scheduling
Recommendations
Taming warp divergence
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and OptimizationGraphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming
Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution
...
A novel warp scheduling scheme considering long-latency operations for high-performance GPUs
AbstractGraphics processing units (GPUs) have become one of the best platforms for exploiting the plentiful thread-level parallelism of applications. However, GPUs continue to underutilize their hardware resources for optimizing the performance of ...
Efficient warp execution in presence of divergence with collaborative context collection
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureGPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the ...
Comments