skip to main content
10.1145/3386367.3432588acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

Job scheduling for large-scale machine learning clusters

Published: 24 November 2020 Publication History

Abstract

With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications running on modern platforms, it is crucial to satisfy application performance requirements such as meeting deadline and ensuring accuracy. To this end, researchers have proposed several job schedulers for ML clusters. However, none of the previously proposed schedulers consider ML model parallelism, though it has been proposed as an approach to increase the efficiency of running large-scale ML and DL jobs. Thus, in this paper, we propose an ML job Feature based job Scheduling system (MLFS) for ML clusters running both data parallelism and model parallelism ML jobs. MLFS first uses a heuristic scheduling method that considers an ML job's spatial and temporal features to determine task priority for job queue ordering in order to improve job completion time (JCT) and accuracy performance. It uses the data from the heuristic scheduling method for training a deep reinforcement learning (RL) model. After the RL model is well trained, it then switches to the RL method to automatically make decisions on job scheduling. Furthermore, MLFS has a system load control method that selects tasks from overloaded servers to move to underloaded servers based on task priority, and also intelligently removes the tasks that generate little or no improvement on the desired accuracy performance when the system is overloaded to improve JCT and accuracy by job deadline. Real experiments and large-scale simulation based on real trace show that MLFS reduces JCT by up to 53% and makespan by up to 52%, and improves accuracy by up to 64% when compared with existing ML job schedulers. We also open sourced our code.

Supplementary Material

MOV File (3386367.3432588.mov)
Job Scheduling for Large-Scale Machine Learning Clusters

References

[1]
[Accessed in June 2020]. Amazon EC2 types. https://aws.amazon.com/ec2/instance-types/.
[2]
[Accessed in June 2020]. ILSVRC2010. https://github.com/Abhisek-/AlexNet.
[3]
[Accessed in June 2020]. Microsoft DNN trace. https://github.com/msr-fiddle/philly-traces/.
[4]
[Accessed in June 2020]. PyTorch. https://pytorch.org.
[5]
[Accessed in June 2020]. Source Code. https://github.com/hiddenlayer2020/ML-Job-Scheduler-MLFS.
[6]
[Accessed in June 2020]. Text classification on R8 Dataset. https://github.com/jiangqy/LSTM-Classification-Pytorch.
[7]
[Accessed in Oct. 2020]. Baidu, Ring all reduce. https://github.com/baidu-research/baidu-allreduce.
[8]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard. 2016. Tensorflow: A system for large-scale machine learning. In Proc. of OSDI.
[9]
W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. 2015. Information-agnostic flow scheduling for commodity data centers. In Proc. of NSDI.
[10]
A. Beloglazov and R. Buyya. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers. Concurrency and Computation: Practice and Experience.
[11]
S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In Proc. of EuroSys.
[12]
C. Chen, C. Yang, and H. Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 (2018).
[13]
M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. 2016. HUG: Multi-Resource Fairness for Correlated and Elastic Demands. In Proc. of NSDI.
[14]
A. Chung, W. Park, and R. Ganger. 2018. Stratus: cost-aware container scheduling in the public cloud. In Proc. of SOCC.
[15]
C. Curino, S. Krishnan, K. Karanasos, S. Rao, M. Fumarola, B. Huang, K. Chaliparambil, A. Suresh, Y. Chen, and S. Heddaya. 2019. Hydra: a federated resource manager for data-center scale analytics. In Proc. of NSDI.
[16]
P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel. 2018. Kairos: Preemptive data center scheduling without runtime estimates. In Proc. of SOCC.
[17]
T. Domhan, J. Springenberg, and F. Hutter. 2019. Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. International Conference on Machine Learning (2019).
[18]
A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. 2018. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks. In Proc. of SPAA.
[19]
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. 2015. Multi-resource packing for cluster schedulers. In Proc. of SIGCOMM.
[20]
R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni. 2016. GRAPHENE: Packing and dependency-aware scheduling for data-parallel clusters. Proc. of OSDI (2016).
[21]
J. Gu, M. Chowdhury, K. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. Proc. of NSDI (2019).
[22]
Y. Han, X. Wang, V. Leung, D. Niyato, X. Yan, and X. Chen. 2019. Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. arXiv preprint arXiv:1907.08349.
[23]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. TicTac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288 (2018).
[24]
B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. Reiss. 2015. Resource elasticity for large-scale machine learning. In Proc. of SIGMOD.
[25]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. Le, and Y. Wu. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. of NIPS.
[26]
M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proc. of ATC.
[27]
Z. Jia, M. Zaharia, and A. Aiken. 2019. Beyond data and model parallelism for deep neural networks. In Proc. of SysML.
[28]
M. Laumanns, L. Thiele, and E. Zitzler. 2006. An efficient, adaptive parameter variation scheme for metaheuristics based on the epsilon-constraint method. European Journal of Operational Research.
[29]
T. Le, X. Sun, M. Chowdhury, and Z. Liu. 2020. AlloX: compute allocation in hybrid clusters. In Proc. of EuroSys.
[30]
S. Lee, J. Kim, X. Zheng, Q. Ho, G. Gibson, and E. Xing. 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. In Proc. of NIPS.
[31]
B. Letham and E. Bakshy. 2019. Bayesian Optimization for Policy Search via Online-Offline Experimentation. Journal of Machine Learning Research (2019).
[32]
R. Liaw, R. Bhardwaj, L. Dunlap, Y. Zou, J. Gonzalez, I. Stoica, and A. Tumanov. 2019. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proc. of SoCC.
[33]
Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887.
[34]
J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In Proc. of NSDI.
[35]
H. Mao, M. Alizadeh, I. Menache, and S. Kandula. 2016. Resource management with deep reinforcement learning. In Proc. of Hotnet.
[36]
H. Mao, S. Chen, D. Dimmery, S. Singh, D. Blaisdell, Y. Tian, M. Alizadeh, and E. Bakshy. 2019. Real-world Video Adaptation with Reinforcement Learning. In Proc. of ICML.
[37]
H. Mao, M. Schwarzkopf, B. Venkatakrishnan, Z. Meng, and M. Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. Proc. of SIGCOMM.
[38]
H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama. 2018. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. arXiv preprint arXiv:1811.05233 (2018).
[39]
A. Mirhoseini, H. Pham, V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. 2017. Device placement optimization with reinforcement learning. In Proc. of ICML.
[40]
P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, and M. Jordan. 2018. Ray: A distributed framework for emerging AI applications. In Proc. of OSDI.
[41]
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. Proc. of SOSP (2019).
[42]
Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proc. of EuroSys.
[43]
Y. Peng, Y. Bao, Y. Chen, C. Wu, C. Meng, and W. Lin. 2019. DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters. In arXiv:1909.06040.
[44]
L. Prechelt. 1998. Early stopping-but when? In Neural Networks: Tricks of the trade.
[45]
J. Rasley, Y. He, F. Yan, O. Ruwase, and R. Fonseca. 2017. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proc. of Middleware.
[46]
A. Sergeev and M. Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[47]
H. Shen and L. Chen. 2017. RIAL: Resource intensity aware load balancing in clouds. Trans. on Cloud Computing.
[48]
D. Silver, A. Huang, J. Maddison, A. Guez, L. Sifre, G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. 2016. Mastering the game of Go with deep neural networks and tree search. Nature.
[49]
P. Sun, Y. Wen, N. Ta, and S. Yan. 2017. Towards distributed machine learning in shared clusters: A dynamically-partitioned approach. In Proc. of SMARTCOMP.
[50]
R. Sutton and A. Barto. 2014. Introduction to reinforcement learning. MIT press.
[51]
S. Sutton, A. McAllester, P. Singh, and Y. Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Proc. of NIPS.
[52]
A. Tumanov, T. Zhu, J. Park, M. Kozuch, M. Harchol-Balter, and G. Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proc. of EuroSys.
[53]
A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale cluster management at Google with Borg. In Pro. of EuroSys.
[54]
J. Wang, J. Xu, and X. Wang. 2018. Combination of hyperband and bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint arXiv:1801.01596 (2018).
[55]
W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, and Q. Zhang. 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proc. of OSDI.
[56]
H. Yabuuchi, D. Taniwaki, and S. Omura. 2019. Low-latency job scheduling with preemption for the development of deep learning. In Proc. of OpML.
[57]
J. Zhan, O. Kayıran, G. Loh, C. Das, and Y. Xie. 2016. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. In Proc. of MICRO.
[58]
H. Zhang, L. Stafman, A. Or, and J. Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In Proc. of SOCC.

Cited By

View all
  • (2025)Reinforcement learning-based task scheduling for heterogeneous computing in end-edge-cloud environmentCluster Computing10.1007/s10586-024-04828-228:3Online publication date: 21-Jan-2025
  • (2024)DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task SchedulingProceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3654823.3654885(342-347)Online publication date: 22-Mar-2024
  • (2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies
November 2020
585 pages
ISBN:9781450379489
DOI:10.1145/3386367
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. job scheduling
  2. machine learning
  3. resource management

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • Microsoft Research Faculty Fellowship
  • AWS Machine Learning Research Awards
  • CCF

Conference

CoNEXT '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)305
  • Downloads (Last 6 weeks)21
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Reinforcement learning-based task scheduling for heterogeneous computing in end-edge-cloud environmentCluster Computing10.1007/s10586-024-04828-228:3Online publication date: 21-Jan-2025
  • (2024)DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task SchedulingProceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3654823.3654885(342-347)Online publication date: 22-Mar-2024
  • (2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
  • (2024)RL-Based Scheduling and Placement for Deep Learning Jobs on Large-Scale GPU Clusters2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781368(1-4)Online publication date: 9-Nov-2024
  • (2024)Fault Tolerant Data and Model Parallel Deep Learning in Edge Computing Networks2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS)10.1109/MASS62177.2024.00067(460-468)Online publication date: 23-Sep-2024
  • (2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
  • (2024)A Survey on Scheduling Techniques in Computing and Network ConvergenceIEEE Communications Surveys & Tutorials10.1109/COMST.2023.332902726:1(160-195)Online publication date: Sep-2025
  • (2023)A survey of Kubernetes scheduling algorithmsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00471-112:1Online publication date: 13-Jun-2023
  • (2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
  • (2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media