skip to main content
10.1145/2155620.2155656acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Improving GPU performance via large warps and two-level warp scheduling

Published:03 December 2011Publication History

ABSTRACT

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations.

To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.

References

  1. Advanced Micro Devices, Inc. ATI Stream Technology. http://www.amd.com/stream.Google ScholarGoogle Scholar
  2. A. Agarwal et al. April: a processor architecture for multiprocessing. In ISCA-17, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Amrutur and M. Horowitz. Speed and power scaling of SRAMs. IEEE JSCC, 35(2):175--185, Feb. 2000.Google ScholarGoogle Scholar
  4. W. J. Bouknight et al. The Illiac IV system. Proceedings of the IEEE, 60(4):369--388, Apr. 1972.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. L. Fung and T. Aamodt. Thread block compaction for efficient simt control flow. In HPCA-17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. L. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO-40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM TACO, 6(2):1--37, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W.-M. Hwu et al. Compute unified device architecture application suitability. Computing in Science Engineering, may-jun 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Jayasena et al. Stream register files with indexed access. In HPCA-10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. U. Kapasi et al. Efficient conditional operations for data-parallel architectures. In MICRO-33, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Khailany et al. Vlsi design and verification of the imagine processor. In ICCD, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  13. Khronos Group. OpenCL. http://www.khronos.org/opencl.Google ScholarGoogle Scholar
  14. D. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Elsevier Science, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Krashinsky et al. The vector-thread architecture. In ISCA-31, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. B. Lakshminarayana and H. Kim. Effect of instruction fetch and memory scheduling on gpu performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.Google ScholarGoogle Scholar
  17. J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA-37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  19. NVIDIA. CUDA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google ScholarGoogle Scholar
  20. NVIDIA. CUDA Programming Guide Version 3.0, 2010.Google ScholarGoogle Scholar
  21. NVIDIA. PTX ISA Version 2.0, 2010.Google ScholarGoogle Scholar
  22. R. M. Russell. The CRAY-1 computer system. Communications of the ACM, 21(1):63--72, Jan. 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. J. Smith. A pipelined shared resource MIMD computer. In ICPP, 1978.Google ScholarGoogle Scholar
  25. J. E. Smith et al. Vector instruction set support for conditional operations. In ISCA-27, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. E. Thornton. Parallel operation in the control data 6600. In AFIPS, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO-34, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving GPU performance via large warps and two-level warp scheduling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
      December 2011
      519 pages
      ISBN:9781450310536
      DOI:10.1145/2155620
      • Conference Chair:
      • Carlo Galuzzi,
      • General Chair:
      • Luigi Carro,
      • Program Chairs:
      • Andreas Moshovos,
      • Milos Prvulovic

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 December 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate484of2,242submissions,22%

      Upcoming Conference

      MICRO '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader