research-article

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Authors:

Xin ChenAuthors Info & Claims

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Pages 15 - 24

https://doi.org/10.1145/2751205.2751234

Published: 08 June 2015 Publication History

Abstract

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.

In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.

References

[1]

A. Bakhoda, G. Yuan, W. Fung, et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, 2009.

[2]

A. Jog, E. Bolotin, Z. Guz, et al. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In GPGPU, pages 1--8, 2014.

[3]

A. Jog, O. Kayiran, A. Mishra, et al. Orchestrated scheduling and prefetching for GPGPUs. In ISCA, pages 332--343, 2013.

Digital Library

[4]

A. Jog, O. Kayiran, N. Nachiappan, et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS, pages 395--406, 2013.

Digital Library

[5]

A. Lashgar, A. Baniasadi, and A. Khonsari. Warp size impact in GPUs: large or small? In GPGPU, pages 146--152, 2013.

Digital Library

[6]

A. Yilmazer, Z. Chen, and D. Kaeli. Scalar waving: improving the efficiency of SIMD execution on GPUs. In IPDPS, pages 103--112, 2014.

Digital Library

[7]

B. He, W. Fang, Q. Luo, et al. Mars: a MapReduce framework on graphics processors. In PACT, pages 260--269, 2008.

Digital Library

[8]

J. Adriaens, K. Compton, N. Kim, et al. The case for GPGPU spatial multitasking. In HPCA, pages 1--12, 2012.

Digital Library

[9]

J. Chen, X. Tao, Z. Yang, et al. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In IPDPS, pages 441--451, 2013.

Digital Library

[10]

J. Jablin, T. Jablin, O. Mutlu, et al. Warp-aware trace scheduling for GPUs. In PACT, pages 163--174, 2014.

Digital Library

[11]

J. Lee and H. Kim. TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA, pages 1--12, 2012.

Digital Library

[12]

J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, pages 235--246, 2010.

Digital Library

[13]

J. Stratton, C. Rodrigues, I. Sung, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.

[14]

Khronos Group. The open standard for parallel programming of heterogeneous systems, 2013.

[15]

L. Chen, O. Villa, S. Krishnamoorthy, et al. Dynamic load balancing on single- and multi-GPU systems. In IPDPS, pages 1--12, 2010.

[16]

M. Gebhart, D. Johnson, D. Tarjan, et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246, 2011.

Digital Library

[17]

M. Gebhart, D. Johnson, D. Tarjan, et al. A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Transactions on Computer Systems, 30(2):1--38, 2012.

Digital Library

[18]

M. Lee, S. Song, J. Moon, et al. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA, pages 263--273, 2014.

[19]

N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. In ICS, pages 421--432, 2013.

Digital Library

[20]

N. Brunie, S. Collange, and G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA, pages 49--60, 2012.

Digital Library

[21]

NVIDIA. CUDA C/C++SDK code samples, 2011.

[22]

NVIDIA. CUDA C Programming Guide, 2012.

[23]

O. Kayiran, A. Jog, M. Kandemir, et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In PACT, pages 157--166, 2013.

Digital Library

[24]

O. Kayiran, N. Nachiappan, A. Jog, et al. Managing GPU concurrency in heterogeneous architectures. In MICRO, pages 1--13, 2014.

Digital Library

[25]

P. Xiang, Y. Yang, M. Mantor, et al. Exploiting uniform vector instructions for GPGPU performance, energy Efficiency, and opportunistic reliability enhancement. In ICS, pages 433--442, 2013.

Digital Library

[26]

S. Che, M. Boyer, J. Meng, et al. Rodinia: a benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009.

Digital Library

[27]

S. Lee and C. Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT, pages 175--186, 2014.

Digital Library

[28]

S. Pai, R. Govindarajan, and M. Thazhuthaveetil. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels. In PACT, pages 483--484, 2014.

Digital Library

[29]

T. Rogers, M. O'Connor, and T. Aamodt. Cache-conscious wavefront scheduling. In MICRO, pages 72--83, 2012.

Digital Library

[30]

V. Narasiman, M. Shebanow, C. Lee, et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, pages 308--317, 2011.

Digital Library

[31]

W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, pages 25--36, 2011.

Digital Library

[32]

W. Fung, I. Sham, G. Yuan, et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, pages 407--420, 2007.

Digital Library

[33]

W. Jia, K. Shaw, and M. Martonosi. MRPB: memory request prioritization for massively parallel processors. In HPCA, pages 274--285, 2014.

[34]

X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient scheduling of recursive control flow on GPUs. In ICS, pages 409--420, 2013.

Digital Library

[35]

Y. Yu, X. He, H. Guo, et al. A credit-based load-balance-aware CTA scheduling optimization scheme in GPGPU. International Journal of Parallel Programming, 42(8):1--21, 2014.

[36]

Y. Yu, X. He, H. Guo, et al. APR: a novel parallel repacking algorithm for efficient GPGPU parallel code transformation. In GPGPU, pages 81--89, 2014.

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Riahi ASavadi ANaghibzadeh M(2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00607-023-01255-w
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Show More Cited By

Index Terms

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...
Evolution of thread-level parallelism in desktop applications
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...
Evolution of thread-level parallelism in desktop applications
ISCA '10

As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

June 2015

446 pages

ISBN:9781450335591

DOI:10.1145/2751205

General Chair:
Laxmi N. Bhuyan
University of California, Riverside
,
Program Chairs:
Fred Chong
University of California, Santa Barbara
,
Vivek Sarkar
Rice University

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. National Science Foundation
National Natural Science Foundation of China

Conference

ICS'15

Sponsor:

SIGARCH

ICS'15: 2015 International Conference on Supercomputing

June 8 - 11, 2015

California, Newport Beach, USA

Acceptance Rates

ICS '15 Paper Acceptance Rate 40 of 160 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
455
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Riahi ASavadi ANaghibzadeh M(2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00607-023-01255-w
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Sanyal SBasu PBal ARoy SChakraborty K(2020)Exploring Warp Criticality in Near-Threshold GPGPU Applications Using a Dynamic Choke Point AnalysisIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.294345028:2(456-466)Online publication date: Feb-2020
https://doi.org/10.1109/TVLSI.2019.2943450
Raparti VPasricha S(2020)Approximate NoC and Memory Controller Architectures for GPGPU AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834431:5(25-39)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2958344
Gong XGong XYu LKaeli D(2019)HAWSACM Transactions on Architecture and Code Optimization10.1145/329105016:2(1-22)Online publication date: 18-Apr-2019
https://dl.acm.org/doi/10.1145/3291050
Pandey SGopalakrishnan S(2019)Improving GPGPU Performance Using Efficient Scheduling2019 International Conference on Intelligent Sustainable Systems (ICISS)10.1109/ISS1.2019.8908051(570-577)Online publication date: Feb-2019
https://doi.org/10.1109/ISS1.2019.8908051
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Raparti VPasricha SLu ZVangal SXu JBogdan P(2018)DAPPERProceedings of the Twelfth IEEE/ACM International Symposium on Networks-on-Chip10.5555/3306619.3306626(1-8)Online publication date: 4-Oct-2018
https://dl.acm.org/doi/10.5555/3306619.3306626
Yu CBai YSun QYang H(2018)Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad MemoryACM Transactions on Architecture and Code Optimization10.1145/328084915:4(1-24)Online publication date: 16-Nov-2018
https://dl.acm.org/doi/10.1145/3280849
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten