skip to main content
10.1145/3180270.3180271acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Oversubscribed Command Queues in GPUs

Published: 24 February 2018 Publication History

Abstract

As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive.
Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.

References

[1]
AMD. "Asynchronous shaders". http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Asynchronous-Shaders-White-Paper-FINAL.pdf
[2]
AMD. "AMD FirePro GPUs". http://www.amd.com/en-us/innovations/software-technologies/apu
[3]
AMD. "AMD GCN Architecture". https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
[4]
ATMI: https://gpuopen.com/compute-product/atmi/
[5]
A. Agarwal and P. Kumar. "Economical Duplication Based Task Scheduling for Heterogeneous and Homogeneous Computing Systems". IACC 2009, 2009.
[6]
S. Bansal, P. Kumar, and K. Singh. "An Improved Duplication Strategy for Scheduling Precedence Constrained Graphs in Multiprocessor Systems". Parallel and Distributed Systems, IEEE Transactions on, 14(6), 2003
[7]
M. Bauer, S. Treichler, E. Slaughter, A. Aiken, "Legion: Expressing Locality and Independence with Logical Regions." In the International Conference on Supercomputing, 2012
[8]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. "The gem5 simulator." SIGARCH Comput. Archit. News, 2011.
[9]
R. D. Blumofe and C. E. Leiserson. 1999. "Scheduling multithreaded computations by work stealing". J. ACM 46, 5 (September 1999), 720--748.
[10]
D. Bouvier, and B. Sander. (2014, August). Applying AMD's Kaveri APU for Heterogeneous Computing. In Hot Chips: A Symposium on High Performance Chips (HC26).
[11]
N. Brunie, S. Collange and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," 2012 39th Annual International Symposium on Computer Architecture (ISCA)
[12]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. "X10: an object-oriented approach to non-uniform cluster computing". In Proc. of the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2005
[13]
G. Chen and X. Shen. 2015. "Free launch: optimizing GPU dynamic kernel launches through thread reuse". In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48)
[14]
N. Christofides, "Graph Theory: An algorithmic Approach." 1975.
[15]
K. Chronaki, A. Rico, R. M. Badia, E. Ayguadé, J. Labarta and M. Valero. "Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures". ICS 2015: 329--338
[16]
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. "Solving large, irregular graph problems using adaptive workstealing". In Proc. of the 37th International Conference on Parallel Processing, 2008
[17]
CUDA streams. https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
[18]
M. I. Daoud and N. Kharma, "Efficient compile-time task scheduling for heterogeneous distributed computing systems," 12th International Conference on Parallel and Distributed Systems - (ICPADS'06), Minneapolis, MN, 2006, pp. 9 pp.-.
[19]
A. Duran, E. Ayguad'e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. "Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures". Parallel Processing Letters, 21, 2011
[20]
Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. 2010. "Task Superscalar: An Out-of-Order Task Pipeline". In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO '43).
[21]
M. Frigo, C. E. Leiserson, and K. H. Randall. "The implementation of the Cilk-5 multithreaded language". In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, 1998.
[22]
W. W. L. Fung, I. Sham, G. Yuan and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
[23]
G. Krishnan, D. Bouvier, L. Zhang and P. Dongara. "Energy Efficient Graphics and Multimedia in 28nm Carrizo APU" In Hot Chips: A Symposium on High Performance Chips (HC27).
[24]
M. Hakem and F. Butelle. "Dynamic Critical Path Scheduling Parallel Programs onto Multiprocessors". IPDPS'05, 2005
[25]
I. E. Hajj, J. Gomez-Luna, C. Li, L. W. Chang, D. Milojicic and W. m. Hwu, "KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism," 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
[26]
J. Hwang, Y. Chow, F. D. Anger and C. Lee. 1989. "Scheduling precedence graphs in systems with interprocessor communication times". SIAM J. Comput. 18, 2 (April 1989)
[27]
HSA Foundation. (2016). "HSA Platform System Architecture Specification". Version 1.1. http://www.hsafoundation.com/standards
[28]
HSA Foundation. "HSA Runtime Programmers Reference Manual. Version 1.1" (2016). http://www.hsafoundation.com/standards
[29]
HSA Foundation. (2016). "HSA Runtime Specification". Version 1.1. http://www.hsafoundation.com/standards
[30]
ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
[31]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance". In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS '13)
[32]
H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey. HPX: "A Task Based Programming Model in a Global Address Space". In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS '14), 6:1--6:11, 2014
[33]
M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. "Scheduling strategies for optimistic parallel execution of irregular programs". In Proc. of the 20th annual Symposium on Parallelism in Algorithms and Architectures, 2008.
[34]
S. Kumar, C. J. Hughes and A. Nguyen. 2007. "Carbon: architectural support for fine-grained parallelism on chip multiprocessors". SIGARCH Comput. Archit. News 35, 2 (June 2007), 162--173.
[35]
S. Lee and C. Wu. 2014. "CAWS: criticality-aware warp scheduling for GPGPU workloads". In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT '14)
[36]
C.-H. Liu, C.-F. Li, K.-C. Lai, and C.-C. Wu. "A dynamic Critical Path Duplication Task Scheduling Algorithm for Distributed Heterogeneous Computing Systems". volume 1 of ICPADS 2006, 2006.
[37]
J. Liu, J. Yang and R. Melhem, "SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers," 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
[38]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill and D. A. Wood. "Multifacet's general execution-driven multiprocessor simulator (gems) toolset". SIGARCH Comput. Archit. News, 33(4):92--99, November 2005.
[39]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. "Improving GPU performance via large warps and two-level warp scheduling". In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44).
[40]
NVIDIA "DYNAMIC PARALLELISM IN CUDA", http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
[41]
NVIDIA Tesla GPUs: http://www.nvidia.com/object/tesla-servers.html
[42]
NVIDIA, "JP Morgan Speeds Risk Calculations with NVIDIA GPUs," 2011.
[43]
OpenMP4.5 Specification. (2015). "The OpenMP Architecture Review Board".http://www.openmp.org/mp-documents/openmp-4.5.pdf
[44]
A. Page and T. Naughton. "Dynamic Task Scheduling using Genetic Algorithms for Heterogeneous Distributed Computing". In Parallel and Distributed Processing Symposium, 2005.
[45]
S. I. Park, S. P. Ponce, J. Huang, Y. Cao and F. Quek, "Low-cost, high-speed computer vision using NVIDIA's CUDA architecture," 2008 37th IEEE Applied Imagery Pattern Recognition Workshop, Washington DC, 2008, pp. 1--7.
[46]
G. Pratx and L. Xing, "GPU Computing in Medical Physics: A Review," in Medical physics, 2011.
[47]
S. Puthoor, A. M. Aji, S. Che, M. Daga, W. Wu, B. M. Beckmann, and G. Rodgers. 2016. "Implementing directed acyclic graphs with the heterogeneous system architecture." In Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit (GPGPU '16). ACM, New York, NY, USA, 53--62.
[48]
T. G. Rogers, M. O'Connor and T. M. Aamodt, "Divergence-Aware Warp Scheduling," 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
[49]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. 2012. "Cache-Conscious Wavefront Scheduling". In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)
[50]
D. Sanchez, R. M. Yoo, and C. Kozyrakis. 2010. "Flexible architectural support for fine-grain scheduling". In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS XV)
[51]
S. S. Stone, J. P. Haldar, S. C. Tsao, W. -m. W. Hwu, B. P. Sutton, and Z. -P. Liang. 2008. "Accelerating advanced MRI reconstructions on GPUs". J. Parallel Distrib. Comput. 68, 10 (October 2008)
[52]
X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, S. Pai, M. Ibrahim, M. T. Kandemir and C. Das. "Controlled Kernel Launch for Dynamic Parallelism in GPUs." In proceedings of The 23rd International Symposium on High-Performance Computer Architecture (HPCA 2017)
[53]
Y. Tao, H. Jin, S. Wu, X. Shi, and L. Shi. 2013. "Dependable Grid Workflow Scheduling Based on Resource Availability". Journal of grid computing (2013): 1--15.
[54]
H. Topçuoğlu, S. Hariri and Min-You Wu, "Task scheduling algorithms for heterogeneous processors," Heterogeneous Computing Workshop, 1999. (HCW '99) Proceedings. Eighth, San Juan, 1999, pp. 3--14.
[55]
J. D. Ullman. "NP-Complete Scheduling Problems," Journal Computer and Systems Sciences, vol. 10 pp. 384--393, 1975.
[56]
J. Wang, and Y. Sudhakar. "Characterization and analysis of dynamic parallelism in unstructured GPU applications." Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 2014.
[57]
J. Wang, N. Rubin, A. Sidelnik and S. Yalamanchili, "Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs," 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
[58]
J. Wang, N. Rubin, A. Sidelnik and S. Yalamanchili, "LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
[59]
M. Y. Wu and D. D. Gajski, "Hypertool: a programming aid for message-passing systems," in IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 3, pp. 330--343, Jul 1990.
[60]
C. Wakeland, A. Lyashevsky and L. Antani. "Scalable Acceleration of Realtime Audio Processing Using Hardware-Partitioned GPU Compute Units", GameSoundCon, 2016. https://www.gamesoundcon.com/2016-game-sound
[61]
Z. Zong, A. Manzanares, X. Ruan, and X. Qin. "EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters". Computers, IEEE Transactions on, 60(3), 2011.

Cited By

View all
  • (2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
  • (2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
GPGPU-11: Proceedings of the 11th Workshop on General Purpose GPUs
February 2018
64 pages
ISBN:9781450356473
DOI:10.1145/3180270
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PPoPP '18

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
  • (2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
  • (2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
  • (2022)RAISE: Efficient GPU Resource Management via Hybrid Scheduling2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00078(685-695)Online publication date: May-2022
  • (2022)Exploring AMD GPU scheduling details by experimenting with “worst practices”Real-Time Systems10.1007/s11241-022-09381-y58:2(105-133)Online publication date: 23-Mar-2022
  • (2022)Visualization of profiling and tracing in CPU‐GPU programsConcurrency and Computation: Practice and Experience10.1002/cpe.718834:23Online publication date: 19-Jul-2022
  • (2021)Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip ResourcesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480105(1169-1181)Online publication date: 18-Oct-2021
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media