Skip to main content
Log in

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, GPU clusters are available in almost every data processing center. Their GPUs are typically shared by different applications that might have different processing needs and/or different levels of priority. In this scenario, concurrent kernel execution can leverage the use of devices by co-executing kernels having a different or complementary resource utilization profile. A paramount issue in concurrent kernel execution on GPU is to obtain a suitable distribution of streaming multiproccessor (SM) resources among co-executing kernels to fulfill different scheduling aims. In this work, we present a software scheduler, named FlexSched, that employs a run-time mechanism with low overhead to perform intra-SM cooperative thread arrays (a.k.a. thread block) allocation of co-executing kernels. It also implements a productive online profiling mechanism that allows dynamically changing kernels resource assignation attending to the instant performance achieved for co-running kernels. An important characteristic of our approach is that off-line kernel analysis to establish the best resource assignment of co-located kernels is not required. Thus, it can run in any system where new applications must be immediately scheduled. Using a set of nine applications (13 kernels), we show our approach improves the co-execution performance of recent slicing methods. Moreover, our approach obtains a co-execution speedup of 1.40\(\times \) while slicing method just achieves 1.29\(\times \). In addition, we test FlexSched in a real scheduling scenario where new applications are launched as soon as GPU resources become available. In this scenario, FlexSched reduces the average overall execution time by a factor of 1.25\(\times \) with respect to the time obtained when proprietary hardware (HyperQ) is employed. Finally, FlexSched is also used to implement scheduling policies that guarantee maximum turnaround time for latency sensitive applications while achieving high resource use through kernel co-execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Lázaro-Muñoz AJ, González-Linares J, Gómez-Luna J, Guil N (2017) A tasks reordering model to reduce transfers overhead on GPUs. J Parallel Distrib Comput 109:258–271. https://doi.org/10.1016/j.jpdc.2017.06.015

    Article  Google Scholar 

  2. Wende F, Cordes F, Steinke T (2012) On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High-Performance Computing, pp 74–83. https://doi.org/10.1109/SAAHPC.2012.12

  3. Zhong J, He B (2014) Kernelet: High-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257

    Article  MathSciNet  Google Scholar 

  4. Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y (2011) Timegraph: Gpu scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, pp 2–2

  5. Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels, ASPLOS ’13 407. https://doi.org/10.1145/2451116.2451160

  6. Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760. https://doi.org/10.1109/TPDS.2014.2313342

    Article  Google Scholar 

  7. Lee H, Al Faruque MA (2014) Gpu-evr: Run-time event based real-time scheduling framework on gpgpu platform. In: Proceedings of the Conference on Design, Automation & Test in Europe, DATE ’14, pp 220:1–220:6

  8. Wu B, Chen G, Li D, Shen X, Vetter J (2015) Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, ACM, New York, NY, USA, pp 119–130. https://doi.org/10.1145/2751205.2751213

  9. Yu C, Bai Y, Yang H, Cheng K, Gu Y, Luan Z, Qian D (2018) Smguard: a flexible and fine-grained resource management framework for gpus. IEEE Trans Parallel Distrib Syst 29(12):2849–2862

    Article  Google Scholar 

  10. Chen G, Zhao Y, Shen X, Zhou H (2017) Effisha: A software framework for enabling efficient preemptive scheduling of gpu. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pp 3–16. https://doi.org/10.1145/3018743.3018748

  11. NVIDIA, Cuda sdk code samples (2018). https://www.nvidia.com/object/cuda_get_samples_3.html

  12. Pai S, Govindarajan R, Thazhuthaveetil MJ (2014) Preemptive thread block scheduling with online structural runtime prediction for concurrent gpgpu kernels. In: 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 483–484. https://doi.org/10.1145/2628071.2628117

  13. Tanasic I, Gelado I, Cabezas J, Ramirez A, Navarro N, Valero M (2014) Enabling preemptive multiprogramming on gpus. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 193–204. https://doi.org/10.1109/ISCA.2014.6853208

  14. Park JJK, Park Y, Mahlke S (2015) Chimera: Collaborative preemption for multitasking on a shared gpu. In: ACM SIGARCH Computer Architecture News, ASPLOS ’15, pp 593–606. https://doi.org/10.1145/2694344.2694346

  15. Wu B, Liu X, Zhou X, Jiang C (2017) FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGPLAN Notices 52(4):483–496

    Article  Google Scholar 

  16. Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGOPS Oper Syst Rev 51(2):17–32

    Article  Google Scholar 

  17. Shekofteh SK, Noori H, Naghibzadeh M, Fröning H, Yazdi HS (2020) ccuda: Effective co-scheduling of concurrent kernels on gpus. IEEE Trans Parallel Distrib Syst 31(4):766–778. https://doi.org/10.1109/TPDS.2019.2944602

    Article  Google Scholar 

  18. Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M (2016) Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 358–369. https://doi.org/10.1109/HPCA.2016.7446078

  19. Xu Q, Jeon H, Kim K, Ro WW, Annavaram M (2016) Warped-slicer: Efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA ’16, pp 230–242. https://doi.org/10.1109/ISCA.2016.29

  20. Park JJK, Park Y, Mahlke S (2017) Dynamic resource management for efficient utilization of multitasking gpus. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, ACM, New York, NY, USA, pp 527–540. https://doi.org/10.1145/3037697.3037707

  21. Dai H, Lin Z, Li C, Zhao C, Wang F, Zheng N, Zhou H (2018) Accelerate gpu concurrent kernel execution by mitigating memory pipeline stalls. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 208–220. https://doi.org/10.1109/HPCA.2018.00027

  22. Zhao W, Chen Q, Lin H, Zhang J, Leng J, Li C, Zheng W, Li L, Guo M (2019) Themis: Predicting and reining in application-level slowdown on spatial multitasking gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 653–663. https://doi.org/10.1109/IPDPS.2019.00074

  23. Zhong J, He B (2014) Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans Parallel Distrib Syst 25(6):1522–1532. https://doi.org/10.1109/TPDS.2013.257

    Article  MathSciNet  Google Scholar 

  24. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797

  25. Gómez-Luna J, Hajj IE, Chang L, García-Floreszx V, de Gonzalo SG, Jablin TB, Peña AJ, Hwu W (2017) Chai: Collaborative heterogeneous applications for integrated-architectures. In: ISPASS, pp 43–54. https://doi.org/10.1109/ISPASS.2017.7975269

  26. Carvalho P, Quintanilla Cruz R, Drummond L, Bentes C, Clua E, Cataldo E, Marzulo L, Kernel concurrency opportunities based on gpu benchmarks characterization, Cluster Computing 23. https://doi.org/10.1007/s10586-018-02901-1

  27. NVIDIA, CUPTI: User guide. Version: DA-05679-001\_v11e.1 (2020)

  28. Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed gpu simulator. In: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp 163–174. https://doi.org/10.1109/ISPASS.2009.4919648

  29. Adriaens JT, Compton K, Kim NS, Schulte MJ (2012) The case for gpgpu spatial multitasking. In: IEEE International Symposium on High-Performance Comp Architecture, pp 1–12. https://doi.org/10.1109/HPCA.2012.6168946

  30. Wang H, Luo F, Ibrahim M, Kayiran O, Jog A (2018) Efficient and fair multi-programming in gpus via effective bandwidth management. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 247–258. https://doi.org/10.1109/HPCA.2018.00030

  31. Zhao X, Jahre M, Eeckhout L (2020) Hsm: A hybrid slowdown model for multitasking gpus. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, New York, NY, USA, pp 1371–1385. https://doi.org/10.1145/3373376.3378457

Download references

Acknowledgements

This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130) and the Ministry of Education of Spain (PID2019-105396RB-I00). We also thank Nvidia for hardware donations within its GPU Grant Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolás Guil.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

López-Albelda, B., Castro, F.M., González-Linares, J.M. et al. FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs. J Supercomput 78, 43–71 (2022). https://doi.org/10.1007/s11227-021-03819-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03819-z

Keywords

Navigation