skip to main content
10.1145/2925426.2926267acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Barrier-Aware Warp Scheduling for Throughput Processors

Published: 01 June 2016 Publication History

Abstract

Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling.
To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).

References

[1]
CUDA programming guide, version 3.0. NVIDIA CORPORATION, 2010.
[2]
ATI stream technology. Advanced Micro Devices,Inc. http://www.amd.com/stream., 2011.
[3]
OpenCL. Khronos Group. http://www.khronos.org/opencl., 2012.
[4]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174, 2009.
[5]
Daniel Cederman and Philippas Tsigas. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 57--64, 2008.
[6]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 44--54, 2009.
[7]
Wu-Chun Feng and Shucai Xiao. To GPU synchronize or not GPU synchronize? In Proceedings of the International Symposium on Circuits and Systems (ISCAS), pages 3801--3804, 2010.
[8]
Mark Gebhart, Daniel R Johnson, David Tarjan, Stephen W Keckler, William J Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 235--246, 2011.
[9]
Ziyu Guo, Bo Wu, and Xipeng Shen. One stone two birds: Synchronization relaxation and redundancy removal in GPU-CPU translation. In Proceedings of the International Conference on Supercomputing (ICS), pages 25--36, 2012.
[10]
Ziyu Guo, Eddy Zheng Zhang, and Xipeng Shen. Correctly treating synchronizations in compiling fine-grained SPMD-threaded programs for CPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 310--319, 2011.
[11]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang. Mars: A MapReduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 260--269, 2008.
[12]
Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 272--283, 2014.
[13]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 395--406, 2013.
[14]
Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 332--343, 2013.
[15]
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 157--166, 2013.
[16]
Nagesh B Lakshminarayana and Hyesoon Kim. Effect of instruction fetch and memory scheduling on GPU performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.
[17]
Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 260--271, 2014.
[18]
Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. CAWA: Coordinated warp scheduling and cache priorization for critical warp acceleration of GPGPU workloads. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 515--527, 2015.
[19]
Shin-Ying Lee and Carole-Jean Wu. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT), pages 175--186, 2014.
[20]
Dong Li, Minsoo Rhu, Daniel R Johnson, Mike O'Connor, Mattan Erez, Doug Burger, Donald S Fussell, and Stephen W Keckler. Priority-based cache allocation in throughput processors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 1--12, 2015.
[21]
Jiwei Liu, Jun Yang, and Rami Melhem. SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 383--394, 2015.
[22]
Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 235--246, 2010.
[23]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 308--317, 2011.
[24]
CUDA Nvidia. CUDA SDK code samples.
[25]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. ELF: Maximizing memory-level parallelism for gpus with coordinated warp and fetch scheduling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page article 8, 2015.
[26]
Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the IEEE International Symposium on Microarchitecture (MICRO), pages 72--83, 2012.
[27]
Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. Divergence-aware warp scheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 99--110, 2013.
[28]
Ankit Sethia, D Anoushe Jamshidi, and Scott Mahlke. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 174--185, 2015.
[29]
Inderpreet Singh, Arrvindh Shriraman, Wilson Fung, Mike O'Connor, and Tor Aamodt. Cache coherence for GPU architectures. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 578--590, 2013.
[30]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign (UIUC), 2012.
[31]
Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 284--295, 2014.
[32]
Shucai Xiao and Wu-Chun Feng. Inter-block GPU communication via fast barrier synchronization. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.
[33]
Ayse Yilmazer and David Kaeli. HQL: A scalable synchronization mechanism for GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS), pages 475--486, 2013.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)Task Mapping and Scheduling on RISC-V MIMD Processor With Vector Accelerator Using Model-Based ParallelizationIEEE Access10.1109/ACCESS.2024.337390212(35779-35795)Online publication date: 2024
  • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
  • Show More Cited By
  1. Barrier-Aware Warp Scheduling for Throughput Processors

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '16: Proceedings of the 2016 International Conference on Supercomputing
      June 2016
      547 pages
      ISBN:9781450343619
      DOI:10.1145/2925426
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 June 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICS '16
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
      • (2024)Task Mapping and Scheduling on RISC-V MIMD Processor With Vector Accelerator Using Model-Based ParallelizationIEEE Access10.1109/ACCESS.2024.337390212(35779-35795)Online publication date: 2024
      • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
      • (2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
      • (2022)A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00089(863-874)Online publication date: May-2022
      • (2020)Exploring Warp Criticality in Near-Threshold GPGPU Applications Using a Dynamic Choke Point AnalysisIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.294345028:2(456-466)Online publication date: Feb-2020
      • (2020)Thread-Level Locking for SIMT ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295570531:5(1121-1136)Online publication date: 1-May-2020
      • (2020)Selective Replication in Memory-Side GPU Caches2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00082(967-980)Online publication date: Oct-2020
      • (2019)Efficient implementation of OpenACC cache directive on NVIDIA GPUsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271713:1(35-53)Online publication date: 1-Jan-2019
      • (2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media