Abstract
Priority-based performance isolation among DNN tasks in a GPU cluster should be achieved from the perspective of GPUs computing DNNs, rather than the tasks on CPUs that supervise the DNNs. In this paper, we propose gCFS allowing each DNN to have GPU occupancy in proportion to the priority. It inherits the CPU-side fair-share scheduling policy, achieving GPU perspective performance isolation in proportion to priorities. Smaller scheduling granularity enables more precise control over the time slice on GPUs and makes DNN workloads more densely queued, reducing the GPU idle time. Elastically, in scheduling, the length of DNN workload is adjusted to the given time slice, and dynamically the optimal GPU selection is performed. Through experiments with concurrently running multiple DNNs, priority-based performance isolation is significantly improved compared to the case without gCFS, and the makespan and the DNN completion time are reduced by up to 40.4% and 41.8%, respectively.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Karol M, Hluchyj M, Morgan S (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35(12):1347–1356. https://doi.org/10.1109/TCOM.1987.1096719
Xiao W, Bhardwaj R, Ramjee R, Sivathanu M, Kwatra N, Han Z,Patel P, Peng X, Zhao H, Zhang Q, Yang F, Zhou L (2018) Gandiva: introspective cluster scheduling for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp 595–610
Peng Y, Bao Y, Chen Y, Wu C, Guo C (2018) Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the 13th EuroSys Conference, pp 1–14. https://doi.org/10.1145/3190508.3190517
Chen Q, Yang H, Mars J, Tang L (2016) Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In: Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, pp 681–696. https://doi.org/10.1145/2872362.2872368
Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, pp 17–32. https://doi.org/10.1145/3037697.3037700
Chaudhary S, Ramjee R, Sivathanu M, Kwatra N, Viswanatha S (2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387555
Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S (2020) Themis: fair and efficient GPU cluster scheduling. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp 289–304
Le TN, Sun X, Chowdhury M, Liu Z (2020) AlloX: compute allocation in hybrid clusters. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387547
Baruah SK, Cohen NK, Plaxton CG, Varvel DA (1996) Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15(6):600–625. https://doi.org/10.1007/BF01940883
Jones MB, Roşu D, Roşu M (1997) CPU reservations and time constraints: efficient, predictable scheduling of independent activities. SIGOPS Oper Syst Rev 31(5):198–211. https://doi.org/10.1145/268998.266689
Kim M, Noh S, Hyeon J, Hong S (2018) Fair-share scheduling in single-ISA asymmetric multicore architecture via scaled virtual runtime and load redistribution. J Parallel Distrib Comput 111:174–186. https://doi.org/10.1016/j.jpdc.2017.08.012
Kim J, Shin P, Kim M, Hong S (2020) Memory-aware fair-share scheduling for improved performance isolation in the linux kernel. IEEE Access 8:98874–98886. https://doi.org/10.1109/ACCESS.2020.2996596
Huh S, Yoo J, Hong S (2015) Cross-layer resource control and scheduling for improving interactivity in android. Softw Pract Exp 45(11):1549–1570. https://doi.org/10.1002/spe.2285
Amert T, Otterness N, Yang M, Anderson JH, Smith FD (2017) GPU scheduling on the nvidia tx2: hidden details revealed. In: 2017 IEEE Real-Time Systems Symposium (RTSS), pp 104–115. https://doi.org/10.1109/RTSS.2017.00017
Lim C, Kim M (2021) ODMDEF: on-device multi-DNN execution framework utilizing adaptive layer-allocation on general purpose cores and accelerators. IEEE Access 9:85403–85417. https://doi.org/10.1109/ACCESS.2021.3088861
Rennich S (2012) Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Accessed 11 April 2022
Schroeder TC (2011) Peer-to-peer and unified virtual addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf. Accessed 11 Apr 2022
NVIDIA (2012) Issue efficiency. https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. Accessed 11 Apr 2022
PyTorch. https://pytorch.org/. Accessed 11 Apr 2022
Johnson J (2022) Learning pytorch with examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html. Accessed 11 Oct 2022
Ajitsaria A (2020) What is the python global interpreter lock (GIL)? https://realpython.com/python-gil/. Accessed 11 Apr 2022
TorchScript. https://pytorch.org/docs/master/jit.html. Accessed 11 Oct 2022
Yu X, Zeng N, Liu S, Zhang Y (2019) Utilization of DenseNet201 for diagnosis of breast abnormality. Mach Vis Appl 30(7):1135–1144. https://doi.org/10.1007/s00138-019-01042-8
Nguyen LD, Lin D, Lin Z, Cao J (2018) Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS.2018.8351550
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, pp 6105–6114
NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems. Accessed 11 Apr 2022
Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and implementation (OSDI 20), pp 481–498
Jeon M, Venkataraman S, Phanishayee A, Qian J, Xiao W, Yang F (2019) Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp 947–960
Gu J, Chowdhury M, Shin KG, Zhu Y, Jeon M, Qian J, Liu H, Guo C (2019) Tiresias: a GPU cluster manager for distributed deep learning. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp 485–500
Aalto S, Ayesta U, Righter R (2009) On the Gittins index in the M/G/1 queue. Queueing Syst 63(1):437–458. https://doi.org/10.1007/s11134-009-9141-x
Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, Hoboken
Nuyens M, Wierman A (2008) The foreground–background queue: a survey. Perform Eval 65(3):286–307. https://doi.org/10.1016/j.peva.2007.06.028
Chowdhury M, Stoica I (2015) Efficient coflow scheduling without prior knowledge. SIGCOMM Comput Commun Rev 45(4):393–406. https://doi.org/10.1145/2785956.2787480
Corbató FJ, Merwin-Daggett M, Daley RC (1962) An experimental time-sharing system. In: Spring Joint Computer Conference, pp 335–344. https://doi.org/10.1145/1460833.1460871
Zhao H,Han Z, Yang Z, Zhang Q, Yang F,Zhou L, Yang M, Lau FCM, Wang Y, Xiong Y, Wang B (2020) HiveD: sharing a GPU cluster for deep learning with guarantees. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 515–532
Jain P, Mo X, Jain A, Subbaraj H, Durrani RS, Tumanov A, Gonzalez J, Stoica I (2018) Dynamic space–time scheduling for GPU inference. arXiv preprint arXiv:http://arxiv.org/abs/1901.00041
Xiang Y, Kim H (2019) Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042
Goswami A, Young J, Schwan K, Farooqui N, Gavrilovska A, Wolf M, Eisenhauer G (2016) GPUShare: fair-sharing middleware for GPU clouds. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1796–1776. https://doi.org/10.1109/IPDPSW.2016.94
Acknowledgements
This research was financially supported by Hansung University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cho, H., Kim, M. gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation. J Supercomput 79, 5851–5877 (2023). https://doi.org/10.1007/s11227-022-04901-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04901-w