gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation

Cho, Hojin; Kim, Myungsun

doi:10.1007/s11227-022-04901-w

gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation

Published: 27 October 2022

Volume 79, pages 5851–5877, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

360 Accesses
1 Citation
Explore all metrics

Abstract

Priority-based performance isolation among DNN tasks in a GPU cluster should be achieved from the perspective of GPUs computing DNNs, rather than the tasks on CPUs that supervise the DNNs. In this paper, we propose gCFS allowing each DNN to have GPU occupancy in proportion to the priority. It inherits the CPU-side fair-share scheduling policy, achieving GPU perspective performance isolation in proportion to priorities. Smaller scheduling granularity enables more precise control over the time slice on GPUs and makes DNN workloads more densely queued, reducing the GPU idle time. Elastically, in scheduling, the length of DNN workload is adjusted to the given time slice, and dynamically the optimal GPU selection is performed. Through experiments with concurrently running multiple DNNs, priority-based performance isolation is significantly improved compared to the case without gCFS, and the makespan and the DNN completion time are reduced by up to 40.4% and 41.8%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

E-OSched: a load balancing scheduler for heterogeneous multicores

Article 23 May 2018

Multiprovision: a Design Space Exploration tool for multi-tenant resource provisioning in CPU–GPU environments

Article 21 December 2023

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

Article 12 October 2021

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Karol M, Hluchyj M, Morgan S (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35(12):1347–1356. https://doi.org/10.1109/TCOM.1987.1096719
Article Google Scholar
Xiao W, Bhardwaj R, Ramjee R, Sivathanu M, Kwatra N, Han Z,Patel P, Peng X, Zhao H, Zhang Q, Yang F, Zhou L (2018) Gandiva: introspective cluster scheduling for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp 595–610
Peng Y, Bao Y, Chen Y, Wu C, Guo C (2018) Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the 13th EuroSys Conference, pp 1–14. https://doi.org/10.1145/3190508.3190517
Chen Q, Yang H, Mars J, Tang L (2016) Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In: Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, pp 681–696. https://doi.org/10.1145/2872362.2872368
Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, pp 17–32. https://doi.org/10.1145/3037697.3037700
Chaudhary S, Ramjee R, Sivathanu M, Kwatra N, Viswanatha S (2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387555
Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S (2020) Themis: fair and efficient GPU cluster scheduling. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp 289–304
Le TN, Sun X, Chowdhury M, Liu Z (2020) AlloX: compute allocation in hybrid clusters. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387547
Baruah SK, Cohen NK, Plaxton CG, Varvel DA (1996) Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15(6):600–625. https://doi.org/10.1007/BF01940883
Article MathSciNet MATH Google Scholar
Jones MB, Roşu D, Roşu M (1997) CPU reservations and time constraints: efficient, predictable scheduling of independent activities. SIGOPS Oper Syst Rev 31(5):198–211. https://doi.org/10.1145/268998.266689
Article Google Scholar
Kim M, Noh S, Hyeon J, Hong S (2018) Fair-share scheduling in single-ISA asymmetric multicore architecture via scaled virtual runtime and load redistribution. J Parallel Distrib Comput 111:174–186. https://doi.org/10.1016/j.jpdc.2017.08.012
Article Google Scholar
Kim J, Shin P, Kim M, Hong S (2020) Memory-aware fair-share scheduling for improved performance isolation in the linux kernel. IEEE Access 8:98874–98886. https://doi.org/10.1109/ACCESS.2020.2996596
Article Google Scholar
Huh S, Yoo J, Hong S (2015) Cross-layer resource control and scheduling for improving interactivity in android. Softw Pract Exp 45(11):1549–1570. https://doi.org/10.1002/spe.2285
Article Google Scholar
Amert T, Otterness N, Yang M, Anderson JH, Smith FD (2017) GPU scheduling on the nvidia tx2: hidden details revealed. In: 2017 IEEE Real-Time Systems Symposium (RTSS), pp 104–115. https://doi.org/10.1109/RTSS.2017.00017
Lim C, Kim M (2021) ODMDEF: on-device multi-DNN execution framework utilizing adaptive layer-allocation on general purpose cores and accelerators. IEEE Access 9:85403–85417. https://doi.org/10.1109/ACCESS.2021.3088861
Article Google Scholar
Rennich S (2012) Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Accessed 11 April 2022
Schroeder TC (2011) Peer-to-peer and unified virtual addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf. Accessed 11 Apr 2022
NVIDIA (2012) Issue efficiency. https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. Accessed 11 Apr 2022
PyTorch. https://pytorch.org/. Accessed 11 Apr 2022
Johnson J (2022) Learning pytorch with examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html. Accessed 11 Oct 2022
Ajitsaria A (2020) What is the python global interpreter lock (GIL)? https://realpython.com/python-gil/. Accessed 11 Apr 2022
TorchScript. https://pytorch.org/docs/master/jit.html. Accessed 11 Oct 2022
Yu X, Zeng N, Liu S, Zhang Y (2019) Utilization of DenseNet201 for diagnosis of breast abnormality. Mach Vis Appl 30(7):1135–1144. https://doi.org/10.1007/s00138-019-01042-8
Article Google Scholar
Nguyen LD, Lin D, Lin Z, Cao J (2018) Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS.2018.8351550
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, pp 6105–6114
NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems. Accessed 11 Apr 2022
Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and implementation (OSDI 20), pp 481–498
Jeon M, Venkataraman S, Phanishayee A, Qian J, Xiao W, Yang F (2019) Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp 947–960
Gu J, Chowdhury M, Shin KG, Zhu Y, Jeon M, Qian J, Liu H, Guo C (2019) Tiresias: a GPU cluster manager for distributed deep learning. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp 485–500
Aalto S, Ayesta U, Righter R (2009) On the Gittins index in the M/G/1 queue. Queueing Syst 63(1):437–458. https://doi.org/10.1007/s11134-009-9141-x
Article MathSciNet MATH Google Scholar
Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, Hoboken
Book MATH Google Scholar
Nuyens M, Wierman A (2008) The foreground–background queue: a survey. Perform Eval 65(3):286–307. https://doi.org/10.1016/j.peva.2007.06.028
Article Google Scholar
Chowdhury M, Stoica I (2015) Efficient coflow scheduling without prior knowledge. SIGCOMM Comput Commun Rev 45(4):393–406. https://doi.org/10.1145/2785956.2787480
Article Google Scholar
Corbató FJ, Merwin-Daggett M, Daley RC (1962) An experimental time-sharing system. In: Spring Joint Computer Conference, pp 335–344. https://doi.org/10.1145/1460833.1460871
Zhao H,Han Z, Yang Z, Zhang Q, Yang F,Zhou L, Yang M, Lau FCM, Wang Y, Xiong Y, Wang B (2020) HiveD: sharing a GPU cluster for deep learning with guarantees. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 515–532
Jain P, Mo X, Jain A, Subbaraj H, Durrani RS, Tumanov A, Gonzalez J, Stoica I (2018) Dynamic space–time scheduling for GPU inference. arXiv preprint arXiv:http://arxiv.org/abs/1901.00041
Xiang Y, Kim H (2019) Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042
Goswami A, Young J, Schwan K, Farooqui N, Gavrilovska A, Wolf M, Eisenhauer G (2016) GPUShare: fair-sharing middleware for GPU clouds. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1796–1776. https://doi.org/10.1109/IPDPSW.2016.94

Download references

Acknowledgements

This research was financially supported by Hansung University.

Author information

Authors and Affiliations

Department of IT Convergence Engineering, Hansung University, Seoul, 02876, South Korea
Hojin Cho & Myungsun Kim
Department of Applied Artificial Intelligence, Hansung University, Seoul, 02876, South Korea
Myungsun Kim

Authors

Hojin Cho
View author publications
You can also search for this author in PubMed Google Scholar
Myungsun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myungsun Kim.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cho, H., Kim, M. gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation. J Supercomput 79, 5851–5877 (2023). https://doi.org/10.1007/s11227-022-04901-w

Download citation

Accepted: 16 October 2022
Published: 27 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11227-022-04901-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation

Abstract

Access this article

Similar content being viewed by others

E-OSched: a load balancing scheduler for heterogeneous multicores

Multiprovision: a Design Space Exploration tool for multi-tenant resource provisioning in CPU–GPU environments

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation

Abstract

Access this article

Similar content being viewed by others

E-OSched: a load balancing scheduler for heterogeneous multicores

Multiprovision: a Design Space Exploration tool for multi-tenant resource provisioning in CPU–GPU environments

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation