skip to main content
10.1145/2751205.2751234acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Published: 08 June 2015 Publication History


General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.
In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.


A. Bakhoda, G. Yuan, W. Fung, et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, 2009.
A. Jog, E. Bolotin, Z. Guz, et al. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In GPGPU, pages 1--8, 2014.
A. Jog, O. Kayiran, A. Mishra, et al. Orchestrated scheduling and prefetching for GPGPUs. In ISCA, pages 332--343, 2013.
A. Jog, O. Kayiran, N. Nachiappan, et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS, pages 395--406, 2013.
A. Lashgar, A. Baniasadi, and A. Khonsari. Warp size impact in GPUs: large or small? In GPGPU, pages 146--152, 2013.
A. Yilmazer, Z. Chen, and D. Kaeli. Scalar waving: improving the efficiency of SIMD execution on GPUs. In IPDPS, pages 103--112, 2014.
B. He, W. Fang, Q. Luo, et al. Mars: a MapReduce framework on graphics processors. In PACT, pages 260--269, 2008.
J. Adriaens, K. Compton, N. Kim, et al. The case for GPGPU spatial multitasking. In HPCA, pages 1--12, 2012.
J. Chen, X. Tao, Z. Yang, et al. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In IPDPS, pages 441--451, 2013.
J. Jablin, T. Jablin, O. Mutlu, et al. Warp-aware trace scheduling for GPUs. In PACT, pages 163--174, 2014.
J. Lee and H. Kim. TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA, pages 1--12, 2012.
J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, pages 235--246, 2010.
J. Stratton, C. Rodrigues, I. Sung, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.
Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.
L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In IPDPS, pages 1--12, 2010.
M. Gebhart, D. Johnson, D. Tarjan, et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246, 2011.
M. Gebhart, D. Johnson, D. Tarjan, et al. A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Transactions on Computer Systems, 30(2):1--38, 2012.
M. Lee, S. Song, J. Moon, et al. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, pages 263--273, 2014.
N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. In ICS, pages 421--432, 2013.
N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA, pages 49--60, 2012.
NVIDIA. CUDA C/C++SDK code samples, 2011.
NVIDIA. CUDA C Programming Guide, 2012.
O. Kayiran, A. Jog, M. Kandemir, et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In PACT, pages 157--166, 2013.
O. Kayiran, N. Nachiappan, A. Jog, et al. Managing GPU concurrency in heterogeneous architectures. In MICRO, pages 1--13, 2014.
P. Xiang, Y. Yang, M. Mantor, et al. Exploiting uniform vector instructions for GPGPU performance, energy Efficiency, and opportunistic reliability enhancement. In ICS, pages 433--442, 2013.
S. Che, M. Boyer, J. Meng, et al. Rodinia: a benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009.
S. Lee and C. Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT, pages 175--186, 2014.
S. Pai, R. Govindarajan, and M. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels. In PACT, pages 483--484, 2014.
T. Rogers, M. O'Connor, and T. Aamodt. Cache-conscious wavefront scheduling. In MICRO, pages 72--83, 2012.
V. Narasiman, M. Shebanow, C. Lee, et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, pages 308--317, 2011.
W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, pages 25--36, 2011.
W. Fung, I. Sham, G. Yuan, et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, pages 407--420, 2007.
W. Jia, K. Shaw, and M. Martonosi. MRPB: memory request prioritization for massively parallel processors. In HPCA, pages 274--285, 2014.
X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on GPUs. In ICS, pages 409--420, 2013.
Y. Yu, X. He, H. Guo, et al. A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU. International Journal of Parallel Programming, 42(8):1--21, 2014.
Y. Yu, X. He, H. Guo, et al. APR: a novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In GPGPU, pages 81--89, 2014.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
  • (2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
  • Show More Cited By

Index Terms

  1. A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs



    Information & Contributors


    Published In

    cover image ACM Conferences
    ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
    June 2015
    446 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 June 2015


    Request permissions for this article.

    Check for updates

    Author Tags

    1. gpgpu
    2. pipeline stall
    3. thread level parallelism
    4. two-level scheduling
    5. warp scheduler


    • Research-article

    Funding Sources


    ICS'15: 2015 International Conference on Supercomputing
    June 8 - 11, 2015
    California, Newport Beach, USA

    Acceptance Rates

    ICS '15 Paper Acceptance Rate 40 of 160 submissions, 25%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
    • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
    • (2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
    • (2020)Exploring Warp Criticality in Near-Threshold GPGPU Applications Using a Dynamic Choke Point AnalysisIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.294345028:2(456-466)Online publication date: Feb-2020
    • (2020)Approximate NoC and Memory Controller Architectures for GPGPU AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834431:5(25-39)Online publication date: 1-May-2020
    • (2019)HAWSACM Transactions on Architecture and Code Optimization10.1145/329105016:2(1-22)Online publication date: 18-Apr-2019
    • (2019)Improving GPGPU Performance Using Efficient Scheduling2019 International Conference on Intelligent Sustainable Systems (ICISS)10.1109/ISS1.2019.8908051(570-577)Online publication date: Feb-2019
    • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
    • (2018)DAPPERProceedings of the Twelfth IEEE/ACM International Symposium on Networks-on-Chip10.5555/3306619.3306626(1-8)Online publication date: 4-Oct-2018
    • (2018)Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad MemoryACM Transactions on Architecture and Code Optimization10.1145/328084915:4(1-24)Online publication date: 16-Nov-2018
    • Show More Cited By

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media