research-article

Improving GPU performance via large warps and two-level warp scheduling

Authors:
Veynu Narasiman

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Michael Shebanow

Nvidia Corporation

Nvidia Corporation
View Profile

,
Chang Joo Lee

Intel Corporation

Intel Corporation
View Profile

,
Rustam Miftakhutdinov

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Onur Mutlu

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Yale N. Patt

The University of Texas at Austin

The University of Texas at Austin
View Profile

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2011Pages 308–317https://doi.org/10.1145/2155620.2155656

Published:03 December 2011Publication History

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 308–317

ABSTRACT

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations.

To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.

References

Advanced Micro Devices, Inc. ATI Stream Technology. http://www.amd.com/stream.Google Scholar
A. Agarwal et al. April: a processor architecture for multiprocessing. In ISCA-17, 1990. Google ScholarDigital Library
B. Amrutur and M. Horowitz. Speed and power scaling of SRAMs. IEEE JSCC, 35(2):175--185, Feb. 2000.Google Scholar
W. J. Bouknight et al. The Illiac IV system. Proceedings of the IEEE, 60(4):369--388, Apr. 1972.Google ScholarCross Ref
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarDigital Library
W. W. L. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In HPCA-17, 2011. Google ScholarDigital Library
W. W. L. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO-40, 2007. Google ScholarDigital Library
W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM TACO, 6(2):1--37, June 2009. Google ScholarDigital Library
W.-M. Hwu et al. Compute unified device architecture application suitability. Computing in Science Engineering, may-jun 2009. Google ScholarDigital Library
N. Jayasena et al. Stream register files with indexed access. In HPCA-10, 2004. Google ScholarDigital Library
U. Kapasi et al. Efficient conditional operations for data-parallel architectures. In MICRO-33, 2000. Google ScholarDigital Library
B. Khailany et al. Vlsi design and verification of the imagine processor. In ICCD, 2002.Google ScholarCross Ref
Khronos Group. OpenCL. http://www.khronos.org/opencl.Google Scholar
D. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Elsevier Science, 2010. Google ScholarDigital Library
R. Krashinsky et al. The vector-thread architecture. In ISCA-31, 2004. Google ScholarDigital Library
N. B. Lakshminarayana and H. Kim. Effect of instruction fetch and memory scheduling on gpu performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.Google Scholar
J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA-37, 2010. Google ScholarDigital Library
R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.Google ScholarCross Ref
NVIDIA. CUDA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google Scholar
NVIDIA. CUDA Programming Guide Version 3.0, 2010.Google Scholar
NVIDIA. PTX ISA Version 2.0, 2010.Google Scholar
R. M. Russell. The CRAY-1 computer system. Communications of the ACM, 21(1):63--72, Jan. 1978. Google ScholarDigital Library
S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google ScholarDigital Library
B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.Google Scholar
J. E. Smith et al. Vector instruction set support for conditional operations. In ISCA-27, 2000. Google ScholarDigital Library
J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965. Google ScholarDigital Library
D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO-34, 2001. Google ScholarDigital Library

Index Terms

Improving GPU performance via large warps and two-level warp scheduling
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Taming warp divergence
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming

Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution

...
Read More
A novel warp scheduling scheme considering long-latency operations for high-performance GPUs
Abstract
Graphics processing units (GPUs) have become one of the best platforms for exploiting the plentiful thread-level parallelism of applications. However, GPUs continue to underutilize their hardware resources for optimizing the performance of ...
Read More
Efficient warp execution in presence of divergence with collaborative context collection
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
December 2011
519 pages
ISBN:9781450310536
DOI:10.1145/2155620
Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 December 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
SIMD
divergence
warp scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 360
  Total Citations
  View Citations
- 1,998
  Total Downloads
- Downloads (Last 12 months)130
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving GPU performance via large warps and two-level warp scheduling

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Taming warp divergence

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Efficient warp execution in presence of divergence with collaborative context collection