research-article

Oversubscribed Command Queues in GPUs

Authors:

Sooraj Puthoor,

Bradford M. BeckmannAuthors Info & Claims

GPGPU-11: Proceedings of the 11th Workshop on General Purpose GPUs

Pages 50 - 60

https://doi.org/10.1145/3180270.3180271

Published: 24 February 2018 Publication History

Abstract

As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive.

Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.

References

[1]

AMD. "Asynchronous shaders". http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Asynchronous-Shaders-White-Paper-FINAL.pdf

[2]

AMD. "AMD FirePro GPUs". http://www.amd.com/en-us/innovations/software-technologies/apu

[3]

AMD. "AMD GCN Architecture". https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

[4]

ATMI: https://gpuopen.com/compute-product/atmi/

[5]

A. Agarwal and P. Kumar. "Economical Duplication Based Task Scheduling for Heterogeneous and Homogeneous Computing Systems". IACC 2009, 2009.

[6]

S. Bansal, P. Kumar, and K. Singh. "An Improved Duplication Strategy for Scheduling Precedence Constrained Graphs in Multiprocessor Systems". Parallel and Distributed Systems, IEEE Transactions on, 14(6), 2003

Digital Library

[7]

M. Bauer, S. Treichler, E. Slaughter, A. Aiken, "Legion: Expressing Locality and Independence with Logical Regions." In the International Conference on Supercomputing, 2012

Digital Library

[8]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. "The gem5 simulator." SIGARCH Comput. Archit. News, 2011.

Digital Library

[9]

R. D. Blumofe and C. E. Leiserson. 1999. "Scheduling multithreaded computations by work stealing". J. ACM 46, 5 (September 1999), 720--748.

Digital Library

[10]

D. Bouvier, and B. Sander. (2014, August). Applying AMD's Kaveri APU for Heterogeneous Computing. In Hot Chips: A Symposium on High Performance Chips (HC26).

[11]

N. Brunie, S. Collange and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," 2012 39th Annual International Symposium on Computer Architecture (ISCA)

Digital Library

[12]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. "X10: an object-oriented approach to non-uniform cluster computing". In Proc. of the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2005

Digital Library

[13]

G. Chen and X. Shen. 2015. "Free launch: optimizing GPU dynamic kernel launches through thread reuse". In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48)

Digital Library

[14]

N. Christofides, "Graph Theory: An algorithmic Approach." 1975.

Digital Library

[15]

K. Chronaki, A. Rico, R. M. Badia, E. Ayguadé, J. Labarta and M. Valero. "Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures". ICS 2015: 329--338

Digital Library

[16]

G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. "Solving large, irregular graph problems using adaptive workstealing". In Proc. of the 37th International Conference on Parallel Processing, 2008

Digital Library

[17]

CUDA streams. https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

[18]

M. I. Daoud and N. Kharma, "Efficient compile-time task scheduling for heterogeneous distributed computing systems," 12th International Conference on Parallel and Distributed Systems - (ICPADS'06), Minneapolis, MN, 2006, pp. 9 pp.-.

Digital Library

[19]

A. Duran, E. Ayguad'e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. "Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures". Parallel Processing Letters, 21, 2011

[20]

Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. 2010. "Task Superscalar: An Out-of-Order Task Pipeline". In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO '43).

Digital Library

[21]

M. Frigo, C. E. Leiserson, and K. H. Randall. "The implementation of the Cilk-5 multithreaded language". In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, 1998.

Digital Library

[22]

W. W. L. Fung, I. Sham, G. Yuan and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)

Digital Library

[23]

G. Krishnan, D. Bouvier, L. Zhang and P. Dongara. "Energy Efficient Graphics and Multimedia in 28nm Carrizo APU" In Hot Chips: A Symposium on High Performance Chips (HC27).

[24]

M. Hakem and F. Butelle. "Dynamic Critical Path Scheduling Parallel Programs onto Multiprocessors". IPDPS'05, 2005

Digital Library

[25]

I. E. Hajj, J. Gomez-Luna, C. Li, L. W. Chang, D. Milojicic and W. m. Hwu, "KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism," 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

[26]

J. Hwang, Y. Chow, F. D. Anger and C. Lee. 1989. "Scheduling precedence graphs in systems with interprocessor communication times". SIAM J. Comput. 18, 2 (April 1989)

Digital Library

[27]

HSA Foundation. (2016). "HSA Platform System Architecture Specification". Version 1.1. http://www.hsafoundation.com/standards

[28]

HSA Foundation. "HSA Runtime Programmers Reference Manual. Version 1.1" (2016). http://www.hsafoundation.com/standards

[29]

HSA Foundation. (2016). "HSA Runtime Specification". Version 1.1. http://www.hsafoundation.com/standards

[30]

ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html

[31]

A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance". In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS '13)

Digital Library

[32]

H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey. HPX: "A Task Based Programming Model in a Global Address Space". In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS '14), 6:1--6:11, 2014

Digital Library

[33]

M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. "Scheduling strategies for optimistic parallel execution of irregular programs". In Proc. of the 20th annual Symposium on Parallelism in Algorithms and Architectures, 2008.

Digital Library

[34]

S. Kumar, C. J. Hughes and A. Nguyen. 2007. "Carbon: architectural support for fine-grained parallelism on chip multiprocessors". SIGARCH Comput. Archit. News 35, 2 (June 2007), 162--173.

Digital Library

[35]

S. Lee and C. Wu. 2014. "CAWS: criticality-aware warp scheduling for GPGPU workloads". In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT '14)

Digital Library

[36]

C.-H. Liu, C.-F. Li, K.-C. Lai, and C.-C. Wu. "A dynamic Critical Path Duplication Task Scheduling Algorithm for Distributed Heterogeneous Computing Systems". volume 1 of ICPADS 2006, 2006.

Digital Library

[37]

J. Liu, J. Yang and R. Melhem, "SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers," 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Digital Library

[38]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill and D. A. Wood. "Multifacet's general execution-driven multiprocessor simulator (gems) toolset". SIGARCH Comput. Archit. News, 33(4):92--99, November 2005.

Digital Library

[39]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. "Improving GPU performance via large warps and two-level warp scheduling". In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44).

Digital Library

[40]

NVIDIA "DYNAMIC PARALLELISM IN CUDA", http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

[41]

NVIDIA Tesla GPUs: http://www.nvidia.com/object/tesla-servers.html

[42]

NVIDIA, "JP Morgan Speeds Risk Calculations with NVIDIA GPUs," 2011.

[43]

OpenMP4.5 Specification. (2015). "The OpenMP Architecture Review Board".http://www.openmp.org/mp-documents/openmp-4.5.pdf

[44]

A. Page and T. Naughton. "Dynamic Task Scheduling using Genetic Algorithms for Heterogeneous Distributed Computing". In Parallel and Distributed Processing Symposium, 2005.

Digital Library

[45]

S. I. Park, S. P. Ponce, J. Huang, Y. Cao and F. Quek, "Low-cost, high-speed computer vision using NVIDIA's CUDA architecture," 2008 37th IEEE Applied Imagery Pattern Recognition Workshop, Washington DC, 2008, pp. 1--7.

Digital Library

[46]

G. Pratx and L. Xing, "GPU Computing in Medical Physics: A Review," in Medical physics, 2011.

[47]

S. Puthoor, A. M. Aji, S. Che, M. Daga, W. Wu, B. M. Beckmann, and G. Rodgers. 2016. "Implementing directed acyclic graphs with the heterogeneous system architecture." In Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit (GPGPU '16). ACM, New York, NY, USA, 53--62.

Digital Library

[48]

T. G. Rogers, M. O'Connor and T. M. Aamodt, "Divergence-Aware Warp Scheduling," 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Digital Library

[49]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. 2012. "Cache-Conscious Wavefront Scheduling". In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)

Digital Library

[50]

D. Sanchez, R. M. Yoo, and C. Kozyrakis. 2010. "Flexible architectural support for fine-grain scheduling". In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS XV)

Digital Library

[51]

S. S. Stone, J. P. Haldar, S. C. Tsao, W. -m. W. Hwu, B. P. Sutton, and Z. -P. Liang. 2008. "Accelerating advanced MRI reconstructions on GPUs". J. Parallel Distrib. Comput. 68, 10 (October 2008)

Digital Library

[52]

X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, S. Pai, M. Ibrahim, M. T. Kandemir and C. Das. "Controlled Kernel Launch for Dynamic Parallelism in GPUs." In proceedings of The 23rd International Symposium on High-Performance Computer Architecture (HPCA 2017)

[53]

Y. Tao, H. Jin, S. Wu, X. Shi, and L. Shi. 2013. "Dependable Grid Workflow Scheduling Based on Resource Availability". Journal of grid computing (2013): 1--15.

Digital Library

[54]

H. Topçuoğlu, S. Hariri and Min-You Wu, "Task scheduling algorithms for heterogeneous processors," Heterogeneous Computing Workshop, 1999. (HCW '99) Proceedings. Eighth, San Juan, 1999, pp. 3--14.

Digital Library

[55]

J. D. Ullman. "NP-Complete Scheduling Problems," Journal Computer and Systems Sciences, vol. 10 pp. 384--393, 1975.

Digital Library

[56]

J. Wang, and Y. Sudhakar. "Characterization and analysis of dynamic parallelism in unstructured GPU applications." Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 2014.

[57]

J. Wang, N. Rubin, A. Sidelnik and S. Yalamanchili, "Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs," 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Digital Library

[58]

J. Wang, N. Rubin, A. Sidelnik and S. Yalamanchili, "LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Digital Library

[59]

M. Y. Wu and D. D. Gajski, "Hypertool: a programming aid for message-passing systems," in IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 3, pp. 330--343, Jul 1990.

Digital Library

[60]

C. Wakeland, A. Lyashevsky and L. Antani. "Scalable Acceleration of Realtime Audio Processing Using Hardware-Partitioned GPU Compute Units", GameSoundCon, 2016. https://www.gamesoundcon.com/2016-game-sound

[61]

Z. Zong, A. Manzanares, X. Ruan, and X. Qin. "EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters". Computers, IEEE Transactions on, 60(3), 2011.

Digital Library

Cited By

Durvasula SZhao AKiguru RGuan YChen ZVijaykumar N(2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676897
Ramakrishnaiah VBeckmann BEhrett PVan Oostrum RLowery K(2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3649411.3649415
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Show More Cited By

Index Terms

Oversubscribed Command Queues in GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

LU Decomposition on GPUs: The Impact of Memory Access
SBAC-PADW '10: Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops

Graphics Processing Units (GPUs) are emerging as an attractive computing platform for general purpose computations due to their extremely high floating-point processing performance and their comparatively low cost. In the context of dense linear algebra,...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GPGPU-11: Proceedings of the 11th Workshop on General Purpose GPUs

February 2018

64 pages

ISBN:9781450356473

DOI:10.1145/3180270

Conference Chairs:
David Kaeli,
John Cavazos

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
317
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Durvasula SZhao AKiguru RGuan YChen ZVijaykumar N(2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676897
Ramakrishnaiah VBeckmann BEhrett PVan Oostrum RLowery K(2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3649411.3649415
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Lin ZMo ZHuang XZhang XLu Y(2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00046
Weng YGe TZhang XZhang XLu Y(2022)RAISE: Efficient GPU Resource Management via Hybrid Scheduling2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00078(685-695)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00078
Otterness NAnderson J(2022)Exploring AMD GPU scheduling details by experimenting with “worst practices”Real-Time Systems10.1007/s11241-022-09381-y58:2(105-133)Online publication date: 23-Mar-2022
https://doi.org/10.1007/s11241-022-09381-y
Fiorini ADagenais M(2022)Visualization of profiling and tracing in CPU‐GPU programsConcurrency and Computation: Practice and Experience10.1002/cpe.718834:23Online publication date: 19-Jul-2022
https://doi.org/10.1002/cpe.7188
Kotra JLeBeane MKandemir MLoh G(2021)Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip ResourcesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480105(1169-1181)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480105
Abdolrashidi AEsfeden HJahanshahi ASingh KAbu-Ghazaleh NWong DMartínez JDuato JJohn L(2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00034
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten