Abstract
Nowadays, GPU clusters are available in almost every data processing center. Their GPUs are typically shared by different applications that might have different processing needs and/or different levels of priority. In this scenario, concurrent kernel execution can leverage the use of devices by co-executing kernels having a different or complementary resource utilization profile. A paramount issue in concurrent kernel execution on GPU is to obtain a suitable distribution of streaming multiproccessor (SM) resources among co-executing kernels to fulfill different scheduling aims. In this work, we present a software scheduler, named FlexSched, that employs a run-time mechanism with low overhead to perform intra-SM cooperative thread arrays (a.k.a. thread block) allocation of co-executing kernels. It also implements a productive online profiling mechanism that allows dynamically changing kernels resource assignation attending to the instant performance achieved for co-running kernels. An important characteristic of our approach is that off-line kernel analysis to establish the best resource assignment of co-located kernels is not required. Thus, it can run in any system where new applications must be immediately scheduled. Using a set of nine applications (13 kernels), we show our approach improves the co-execution performance of recent slicing methods. Moreover, our approach obtains a co-execution speedup of 1.40\(\times \) while slicing method just achieves 1.29\(\times \). In addition, we test FlexSched in a real scheduling scenario where new applications are launched as soon as GPU resources become available. In this scenario, FlexSched reduces the average overall execution time by a factor of 1.25\(\times \) with respect to the time obtained when proprietary hardware (HyperQ) is employed. Finally, FlexSched is also used to implement scheduling policies that guarantee maximum turnaround time for latency sensitive applications while achieving high resource use through kernel co-execution.
Similar content being viewed by others
References
Lázaro-Muñoz AJ, González-Linares J, Gómez-Luna J, Guil N (2017) A tasks reordering model to reduce transfers overhead on GPUs. J Parallel Distrib Comput 109:258–271. https://doi.org/10.1016/j.jpdc.2017.06.015
Wende F, Cordes F, Steinke T (2012) On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High-Performance Computing, pp 74–83. https://doi.org/10.1109/SAAHPC.2012.12
Zhong J, He B (2014) Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011) Timegraph: Gpu scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp 2–2
Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels, ASPLOS ’13 407. https://doi.org/10.1145/2451116.2451160
Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760. https://doi.org/10.1109/TPDS.2014.2313342
Lee H, Al Faruque MA (2014) Gpu-evr: Run-time event based real-time scheduling framework on gpgpu platform. In: Proceedings of the Conference on Design, Automation & Test in Europe, DATE ’14, pp 220:1–220:6
Wu B, Chen G, Li D, Shen X, Vetter J (2015) Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, ACM, New York, NY, USA, pp 119–130. https://doi.org/10.1145/2751205.2751213
Yu C, Bai Y, Yang H, Cheng K, Gu Y, Luan Z, Qian D (2018) Smguard: a flexible and fine-grained resource management framework for gpus. IEEE Trans Parallel Distrib Syst 29(12):2849–2862
Chen G, Zhao Y, Shen X, Zhou H (2017) Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pp 3–16. https://doi.org/10.1145/3018743.3018748
NVIDIA, Cuda sdk code samples (2018). https://www.nvidia.com/object/cuda_get_samples_3.html
Pai S, Govindarajan R, Thazhuthaveetil MJ (2014) Preemptive thread block scheduling with online structural runtime prediction for concurrent gpgpu kernels. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 483–484. https://doi.org/10.1145/2628071.2628117
Tanasic I, Gelado I, Cabezas J, Ramirez A, Navarro N, Valero M (2014) Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 193–204. https://doi.org/10.1109/ISCA.2014.6853208
Park JJK, Park Y, Mahlke S (2015) Chimera: Collaborative preemption for multitasking on a shared gpu. In: ACM SIGARCH Computer Architecture News, ASPLOS ’15, pp 593–606. https://doi.org/10.1145/2694344.2694346
Wu B, Liu X, Zhou X, Jiang C (2017) FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGPLAN Notices 52(4):483–496
Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGOPS Oper Syst Rev 51(2):17–32
Shekofteh SK, Noori H, Naghibzadeh M, Fröning H, Yazdi HS (2020) ccuda: Effective co-scheduling of concurrent kernels on gpus. IEEE Trans Parallel Distrib Syst 31(4):766–778. https://doi.org/10.1109/TPDS.2019.2944602
Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M (2016) Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 358–369. https://doi.org/10.1109/HPCA.2016.7446078
Xu Q, Jeon H, Kim K, Ro WW, Annavaram M (2016) Warped-slicer: Efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA ’16, pp 230–242. https://doi.org/10.1109/ISCA.2016.29
Park JJK, Park Y, Mahlke S (2017) Dynamic resource management for efficient utilization of multitasking gpus. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, ACM, New York, NY, USA, pp 527–540. https://doi.org/10.1145/3037697.3037707
Dai H, Lin Z, Li C, Zhao C, Wang F, Zheng N, Zhou H (2018) Accelerate gpu concurrent kernel execution by mitigating memory pipeline stalls. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 208–220. https://doi.org/10.1109/HPCA.2018.00027
Zhao W, Chen Q, Lin H, Zhang J, Leng J, Li C, Zheng W, Li L, Guo M (2019) Themis: Predicting and reining in application-level slowdown on spatial multitasking gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 653–663. https://doi.org/10.1109/IPDPS.2019.00074
Zhong J, He B (2014) Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797
Gómez-Luna J, Hajj IE, Chang L, García-Floreszx V, de Gonzalo SG, Jablin TB, Peña AJ, Hwu W (2017) Chai: Collaborative heterogeneous applications for integrated-architectures. In: ISPASS, pp 43–54. https://doi.org/10.1109/ISPASS.2017.7975269
Carvalho P, Quintanilla Cruz R, Drummond L, Bentes C, Clua E, Cataldo E, Marzulo L, Kernel concurrency opportunities based on gpu benchmarks characterization, Cluster Computing 23. https://doi.org/10.1007/s10586-018-02901-1
NVIDIA, CUPTI: User guide. Version: DA-05679-001\_v11e.1 (2020)
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed gpu simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174. https://doi.org/10.1109/ISPASS.2009.4919648
Adriaens JT, Compton K, Kim NS, Schulte MJ (2012) The case for gpgpu spatial multitasking. In: IEEE International Symposium on High-Performance Comp Architecture, pp 1–12. https://doi.org/10.1109/HPCA.2012.6168946
Wang H, Luo F, Ibrahim M, Kayiran O, Jog A (2018) Efficient and fair multi-programming in gpus via effective bandwidth management. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 247–258. https://doi.org/10.1109/HPCA.2018.00030
Zhao X, Jahre M, Eeckhout L (2020) Hsm: A hybrid slowdown model for multitasking gpus. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, New York, NY, USA, pp 1371–1385. https://doi.org/10.1145/3373376.3378457
Acknowledgements
This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130) and the Ministry of Education of Spain (PID2019-105396RB-I00). We also thank Nvidia for hardware donations within its GPU Grant Program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
López-Albelda, B., Castro, F.M., González-Linares, J.M. et al. FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs. J Supercomput 78, 43–71 (2022). https://doi.org/10.1007/s11227-021-03819-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03819-z