FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

López-Albelda, Bernabé; Castro, Francisco M.; González-Linares, José M.; Guil, Nicolás

doi:10.1007/s11227-021-03819-z

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Published: 19 May 2021

Volume 78, pages 43–71, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Bernabé López-Albelda¹,
Francisco M. Castro¹,
José M. González-Linares¹ &
…
Nicolás Guil ORCID: orcid.org/0000-0003-3431-6516¹

617 Accesses
2 Citations
Explore all metrics

Abstract

Nowadays, GPU clusters are available in almost every data processing center. Their GPUs are typically shared by different applications that might have different processing needs and/or different levels of priority. In this scenario, concurrent kernel execution can leverage the use of devices by co-executing kernels having a different or complementary resource utilization profile. A paramount issue in concurrent kernel execution on GPU is to obtain a suitable distribution of streaming multiproccessor (SM) resources among co-executing kernels to fulfill different scheduling aims. In this work, we present a software scheduler, named FlexSched, that employs a run-time mechanism with low overhead to perform intra-SM cooperative thread arrays (a.k.a. thread block) allocation of co-executing kernels. It also implements a productive online profiling mechanism that allows dynamically changing kernels resource assignation attending to the instant performance achieved for co-running kernels. An important characteristic of our approach is that off-line kernel analysis to establish the best resource assignment of co-located kernels is not required. Thus, it can run in any system where new applications must be immediately scheduled. Using a set of nine applications (13 kernels), we show our approach improves the co-execution performance of recent slicing methods. Moreover, our approach obtains a co-execution speedup of 1.40\(\times \) while slicing method just achieves 1.29\(\times \). In addition, we test FlexSched in a real scheduling scenario where new applications are launched as soon as GPU resources become available. In this scenario, FlexSched reduces the average overall execution time by a factor of 1.25\(\times \) with respect to the time obtained when proprietary hardware (HyperQ) is employed. Finally, FlexSched is also used to implement scheduling policies that guarantee maximum turnaround time for latency sensitive applications while achieving high resource use through kernel co-execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient High-Level Programming in Plain Java

Article 05 December 2022

WebAssembly as an Enabler for Next Generation Serverless Computing

Article 26 June 2023

GPU Architecture

References

Lázaro-Muñoz AJ, González-Linares J, Gómez-Luna J, Guil N (2017) A tasks reordering model to reduce transfers overhead on GPUs. J Parallel Distrib Comput 109:258–271. https://doi.org/10.1016/j.jpdc.2017.06.015
Article Google Scholar
Wende F, Cordes F, Steinke T (2012) On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High-Performance Computing, pp 74–83. https://doi.org/10.1109/SAAHPC.2012.12
Zhong J, He B (2014) Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257
Article MathSciNet Google Scholar
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011) Timegraph: Gpu scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp 2–2
Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels, ASPLOS ’13 407. https://doi.org/10.1145/2451116.2451160
Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760. https://doi.org/10.1109/TPDS.2014.2313342
Article Google Scholar
Lee H, Al Faruque MA (2014) Gpu-evr: Run-time event based real-time scheduling framework on gpgpu platform. In: Proceedings of the Conference on Design, Automation & Test in Europe, DATE ’14, pp 220:1–220:6
Wu B, Chen G, Li D, Shen X, Vetter J (2015) Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, ACM, New York, NY, USA, pp 119–130. https://doi.org/10.1145/2751205.2751213
Yu C, Bai Y, Yang H, Cheng K, Gu Y, Luan Z, Qian D (2018) Smguard: a flexible and fine-grained resource management framework for gpus. IEEE Trans Parallel Distrib Syst 29(12):2849–2862
Article Google Scholar
Chen G, Zhao Y, Shen X, Zhou H (2017) Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pp 3–16. https://doi.org/10.1145/3018743.3018748
NVIDIA, Cuda sdk code samples (2018). https://www.nvidia.com/object/cuda_get_samples_3.html
Pai S, Govindarajan R, Thazhuthaveetil MJ (2014) Preemptive thread block scheduling with online structural runtime prediction for concurrent gpgpu kernels. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 483–484. https://doi.org/10.1145/2628071.2628117
Tanasic I, Gelado I, Cabezas J, Ramirez A, Navarro N, Valero M (2014) Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 193–204. https://doi.org/10.1109/ISCA.2014.6853208
Park JJK, Park Y, Mahlke S (2015) Chimera: Collaborative preemption for multitasking on a shared gpu. In: ACM SIGARCH Computer Architecture News, ASPLOS ’15, pp 593–606. https://doi.org/10.1145/2694344.2694346
Wu B, Liu X, Zhou X, Jiang C (2017) FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGPLAN Notices 52(4):483–496
Article Google Scholar
Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGOPS Oper Syst Rev 51(2):17–32
Article Google Scholar
Shekofteh SK, Noori H, Naghibzadeh M, Fröning H, Yazdi HS (2020) ccuda: Effective co-scheduling of concurrent kernels on gpus. IEEE Trans Parallel Distrib Syst 31(4):766–778. https://doi.org/10.1109/TPDS.2019.2944602
Article Google Scholar
Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M (2016) Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 358–369. https://doi.org/10.1109/HPCA.2016.7446078
Xu Q, Jeon H, Kim K, Ro WW, Annavaram M (2016) Warped-slicer: Efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA ’16, pp 230–242. https://doi.org/10.1109/ISCA.2016.29
Park JJK, Park Y, Mahlke S (2017) Dynamic resource management for efficient utilization of multitasking gpus. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, ACM, New York, NY, USA, pp 527–540. https://doi.org/10.1145/3037697.3037707
Dai H, Lin Z, Li C, Zhao C, Wang F, Zheng N, Zhou H (2018) Accelerate gpu concurrent kernel execution by mitigating memory pipeline stalls. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 208–220. https://doi.org/10.1109/HPCA.2018.00027
Zhao W, Chen Q, Lin H, Zhang J, Leng J, Li C, Zheng W, Li L, Guo M (2019) Themis: Predicting and reining in application-level slowdown on spatial multitasking gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 653–663. https://doi.org/10.1109/IPDPS.2019.00074
Zhong J, He B (2014) Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257
Article MathSciNet Google Scholar
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797
Gómez-Luna J, Hajj IE, Chang L, García-Floreszx V, de Gonzalo SG, Jablin TB, Peña AJ, Hwu W (2017) Chai: Collaborative heterogeneous applications for integrated-architectures. In: ISPASS, pp 43–54. https://doi.org/10.1109/ISPASS.2017.7975269
Carvalho P, Quintanilla Cruz R, Drummond L, Bentes C, Clua E, Cataldo E, Marzulo L, Kernel concurrency opportunities based on gpu benchmarks characterization, Cluster Computing 23. https://doi.org/10.1007/s10586-018-02901-1
NVIDIA, CUPTI: User guide. Version: DA-05679-001\_v11e.1 (2020)
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed gpu simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
Adriaens JT, Compton K, Kim NS, Schulte MJ (2012) The case for gpgpu spatial multitasking. In: IEEE International Symposium on High-Performance Comp Architecture, pp 1–12. https://doi.org/10.1109/HPCA.2012.6168946
Wang H, Luo F, Ibrahim M, Kayiran O, Jog A (2018) Efficient and fair multi-programming in gpus via effective bandwidth management. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 247–258. https://doi.org/10.1109/HPCA.2018.00030
Zhao X, Jahre M, Eeckhout L (2020) Hsm: A hybrid slowdown model for multitasking gpus. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, New York, NY, USA, pp 1371–1385. https://doi.org/10.1145/3373376.3378457

Download references

Acknowledgements

This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130) and the Ministry of Education of Spain (PID2019-105396RB-I00). We also thank Nvidia for hardware donations within its GPU Grant Program.

Author information

Authors and Affiliations

Department of Computer Architecture, University of Málaga, Campus de Teatinos, 29071, Málaga, Spain
Bernabé López-Albelda, Francisco M. Castro, José M. González-Linares & Nicolás Guil

Authors

Bernabé López-Albelda
View author publications
You can also search for this author in PubMed Google Scholar
Francisco M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
José M. González-Linares
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás Guil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolás Guil.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

López-Albelda, B., Castro, F.M., González-Linares, J.M. et al. FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs. J Supercomput 78, 43–71 (2022). https://doi.org/10.1007/s11227-021-03819-z

Download citation

Accepted: 15 April 2021
Published: 19 May 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11227-021-03819-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

WebAssembly as an Enabler for Next Generation Serverless Computing

GPU Architecture

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

WebAssembly as an Enabler for Next Generation Serverless Computing

GPU Architecture

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation