research-article

Job scheduling for large-scale machine learning clusters

Authors:

Haiying ShenAuthors Info & Claims

CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

Pages 108 - 120

https://doi.org/10.1145/3386367.3432588

Published: 24 November 2020 Publication History

Abstract

With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications running on modern platforms, it is crucial to satisfy application performance requirements such as meeting deadline and ensuring accuracy. To this end, researchers have proposed several job schedulers for ML clusters. However, none of the previously proposed schedulers consider ML model parallelism, though it has been proposed as an approach to increase the efficiency of running large-scale ML and DL jobs. Thus, in this paper, we propose an ML job Feature based job Scheduling system (MLFS) for ML clusters running both data parallelism and model parallelism ML jobs. MLFS first uses a heuristic scheduling method that considers an ML job's spatial and temporal features to determine task priority for job queue ordering in order to improve job completion time (JCT) and accuracy performance. It uses the data from the heuristic scheduling method for training a deep reinforcement learning (RL) model. After the RL model is well trained, it then switches to the RL method to automatically make decisions on job scheduling. Furthermore, MLFS has a system load control method that selects tasks from overloaded servers to move to underloaded servers based on task priority, and also intelligently removes the tasks that generate little or no improvement on the desired accuracy performance when the system is overloaded to improve JCT and accuracy by job deadline. Real experiments and large-scale simulation based on real trace show that MLFS reduces JCT by up to 53% and makespan by up to 52%, and improves accuracy by up to 64% when compared with existing ML job schedulers. We also open sourced our code.

Supplementary Material

MOV File (3386367.3432588.mov)

Job Scheduling for Large-Scale Machine Learning Clusters

Download
163.40 MB

References

[1]

[Accessed in June 2020]. Amazon EC2 types. https://aws.amazon.com/ec2/instance-types/.

[2]

[Accessed in June 2020]. ILSVRC2010. https://github.com/Abhisek-/AlexNet.

[3]

[Accessed in June 2020]. Microsoft DNN trace. https://github.com/msr-fiddle/philly-traces/.

[4]

[Accessed in June 2020]. PyTorch. https://pytorch.org.

[5]

[Accessed in June 2020]. Source Code. https://github.com/hiddenlayer2020/ML-Job-Scheduler-MLFS.

[6]

[Accessed in June 2020]. Text classification on R8 Dataset. https://github.com/jiangqy/LSTM-Classification-Pytorch.

[7]

[Accessed in Oct. 2020]. Baidu, Ring all reduce. https://github.com/baidu-research/baidu-allreduce.

[8]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard. 2016. Tensorflow: A system for large-scale machine learning. In Proc. of OSDI.

[9]

W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. 2015. Information-agnostic flow scheduling for commodity data centers. In Proc. of NSDI.

[10]

A. Beloglazov and R. Buyya. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers. Concurrency and Computation: Practice and Experience.

[11]

S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In Proc. of EuroSys.

[12]

C. Chen, C. Yang, and H. Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 (2018).

[13]

M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. 2016. HUG: Multi-Resource Fairness for Correlated and Elastic Demands. In Proc. of NSDI.

[14]

A. Chung, W. Park, and R. Ganger. 2018. Stratus: cost-aware container scheduling in the public cloud. In Proc. of SOCC.

[15]

C. Curino, S. Krishnan, K. Karanasos, S. Rao, M. Fumarola, B. Huang, K. Chaliparambil, A. Suresh, Y. Chen, and S. Heddaya. 2019. Hydra: a federated resource manager for data-center scale analytics. In Proc. of NSDI.

[16]

P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel. 2018. Kairos: Preemptive data center scheduling without runtime estimates. In Proc. of SOCC.

[17]

T. Domhan, J. Springenberg, and F. Hutter. 2019. Speeding up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves. International Conference on Machine Learning (2019).

[18]

A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. 2018. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks. In Proc. of SPAA.

[19]

R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. 2015. Multi-resource packing for cluster schedulers. In Proc. of SIGCOMM.

[20]

R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni. 2016. GRAPHENE: Packing and dependency-aware scheduling for data-parallel clusters. Proc. of OSDI (2016).

[21]

J. Gu, M. Chowdhury, K. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. Proc. of NSDI (2019).

[22]

Y. Han, X. Wang, V. Leung, D. Niyato, X. Yan, and X. Chen. 2019. Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. arXiv preprint arXiv:1907.08349.

[23]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. TicTac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288 (2018).

[24]

B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. Reiss. 2015. Resource elasticity for large-scale machine learning. In Proc. of SIGMOD.

[25]

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. Le, and Y. Wu. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. of NIPS.

[26]

M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proc. of ATC.

[27]

Z. Jia, M. Zaharia, and A. Aiken. 2019. Beyond data and model parallelism for deep neural networks. In Proc. of SysML.

[28]

M. Laumanns, L. Thiele, and E. Zitzler. 2006. An efficient, adaptive parameter variation scheme for metaheuristics based on the epsilon-constraint method. European Journal of Operational Research.

[29]

T. Le, X. Sun, M. Chowdhury, and Z. Liu. 2020. AlloX: compute allocation in hybrid clusters. In Proc. of EuroSys.

[30]

S. Lee, J. Kim, X. Zheng, Q. Ho, G. Gibson, and E. Xing. 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. In Proc. of NIPS.

[31]

B. Letham and E. Bakshy. 2019. Bayesian Optimization for Policy Search via Online-Offline Experimentation. Journal of Machine Learning Research (2019).

[32]

R. Liaw, R. Bhardwaj, L. Dunlap, Y. Zou, J. Gonzalez, I. Stoica, and A. Tumanov. 2019. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proc. of SoCC.

[33]

Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887.

[34]

J. Mace, P. Bodik, R. Fonseca, and M. Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In Proc. of NSDI.

[35]

H. Mao, M. Alizadeh, I. Menache, and S. Kandula. 2016. Resource management with deep reinforcement learning. In Proc. of Hotnet.

[36]

H. Mao, S. Chen, D. Dimmery, S. Singh, D. Blaisdell, Y. Tian, M. Alizadeh, and E. Bakshy. 2019. Real-world Video Adaptation with Reinforcement Learning. In Proc. of ICML.

[37]

H. Mao, M. Schwarzkopf, B. Venkatakrishnan, Z. Meng, and M. Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. Proc. of SIGCOMM.

[38]

H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama. 2018. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. arXiv preprint arXiv:1811.05233 (2018).

[39]

A. Mirhoseini, H. Pham, V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. 2017. Device placement optimization with reinforcement learning. In Proc. of ICML.

[40]

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, and M. Jordan. 2018. Ray: A distributed framework for emerging AI applications. In Proc. of OSDI.

[41]

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. Proc. of SOSP (2019).

[42]

Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proc. of EuroSys.

[43]

Y. Peng, Y. Bao, Y. Chen, C. Wu, C. Meng, and W. Lin. 2019. DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters. In arXiv:1909.06040.

[44]

L. Prechelt. 1998. Early stopping-but when? In Neural Networks: Tricks of the trade.

[45]

J. Rasley, Y. He, F. Yan, O. Ruwase, and R. Fonseca. 2017. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proc. of Middleware.

[46]

A. Sergeev and M. Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[47]

H. Shen and L. Chen. 2017. RIAL: Resource intensity aware load balancing in clouds. Trans. on Cloud Computing.

[48]

D. Silver, A. Huang, J. Maddison, A. Guez, L. Sifre, G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. 2016. Mastering the game of Go with deep neural networks and tree search. Nature.

[49]

P. Sun, Y. Wen, N. Ta, and S. Yan. 2017. Towards distributed machine learning in shared clusters: A dynamically-partitioned approach. In Proc. of SMARTCOMP.

[50]

R. Sutton and A. Barto. 2014. Introduction to reinforcement learning. MIT press.

[51]

S. Sutton, A. McAllester, P. Singh, and Y. Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Proc. of NIPS.

[52]

A. Tumanov, T. Zhu, J. Park, M. Kozuch, M. Harchol-Balter, and G. Ganger. 2016. TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proc. of EuroSys.

[53]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale cluster management at Google with Borg. In Pro. of EuroSys.

[54]

J. Wang, J. Xu, and X. Wang. 2018. Combination of hyperband and bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint arXiv:1801.01596 (2018).

[55]

W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, and Q. Zhang. 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proc. of OSDI.

[56]

H. Yabuuchi, D. Taniwaki, and S. Omura. 2019. Low-latency job scheduling with preemption for the development of deep learning. In Proc. of OpML.

[57]

J. Zhan, O. Kayıran, G. Loh, C. Das, and Y. Xie. 2016. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. In Proc. of MICRO.

[58]

H. Zhang, L. Stafman, A. Or, and J. Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In Proc. of SOCC.

Cited By

Shen WLin WWu WWu HLi K(2025)Reinforcement learning-based task scheduling for heterogeneous computing in end-edge-cloud environmentCluster Computing10.1007/s10586-024-04828-228:3Online publication date: 21-Jan-2025
https://doi.org/10.1007/s10586-024-04828-2
Xiao ZLiu KHu MWu D(2024)DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task SchedulingProceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3654823.3654885(342-347)Online publication date: 22-Mar-2024
https://dl.acm.org/doi/10.1145/3654823.3654885
Gao WYe ZSun PZhang TWen Y(2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
https://doi.org/10.1109/TC.2024.3371794
Show More Cited By

Index Terms

Job scheduling for large-scale machine learning clusters
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

Online Flexible Job Scheduling for Minimum Span
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

In this paper, we study an online Flexible Job Scheduling (FJS) problem. The input of the problem is a set of jobs, each having an arrival time, a starting deadline and a processing length. Each job has to be started by the scheduler between its arrival ...
Job scheduling to minimize the weighted waiting time variance of jobs

This study considers the job scheduling problem of minimizing the weighted waiting time variance (WWTV) of jobs. It is an extension of WTV minimization problems in which we schedule a batch of n jobs, for servicing on a single resource, in such a way ...
Toward balanced and sustainable job scheduling for production supercomputers

Job scheduling on production supercomputers is complicated by diverse demands of system administrators and amorphous characteristics of workloads. Specifically, various scheduling goals such as queuing efficiency and system utilization are usually ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

November 2020

585 pages

ISBN:9781450379489

DOI:10.1145/3386367

Program Chairs:
Dongsu Han
KAIST, South Korea
,
Anja Feldmann
MPI, Germany

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
Microsoft Research Faculty Fellowship
AWS Machine Learning Research Awards
CCF

Conference

CoNEXT '20

Sponsor:

SIGCOMM

CoNEXT '20: The 16th International Conference on emerging Networking EXperiments and Technologies

December 1 - 4, 2020

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
1,690
Total Downloads

Downloads (Last 12 months)305
Downloads (Last 6 weeks)21

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shen WLin WWu WWu HLi K(2025)Reinforcement learning-based task scheduling for heterogeneous computing in end-edge-cloud environmentCluster Computing10.1007/s10586-024-04828-228:3Online publication date: 21-Jan-2025
https://doi.org/10.1007/s10586-024-04828-2
Xiao ZLiu KHu MWu D(2024)DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task SchedulingProceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning10.1145/3654823.3654885(342-347)Online publication date: 22-Mar-2024
https://dl.acm.org/doi/10.1145/3654823.3654885
Gao WYe ZSun PZhang TWen Y(2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
https://doi.org/10.1109/TC.2024.3371794
Liao JShen M(2024)RL-Based Scheduling and Placement for Deep Learning Jobs on Large-Scale GPU Clusters2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781368(1-4)Online publication date: 9-Nov-2024
https://doi.org/10.1109/NAS63802.2024.10781368
Sen TShen H(2024)Fault Tolerant Data and Model Parallel Deep Learning in Edge Computing Networks2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS)10.1109/MASS62177.2024.00067(460-468)Online publication date: 23-Sep-2024
https://doi.org/10.1109/MASS62177.2024.00067
Deng SZhao HHuang BZhang CChen FDeng YYin JDustdar SZomaya A(2024)Cloud-Native Computing: A Survey From the Perspective of ServicesProceedings of the IEEE10.1109/JPROC.2024.3353855112:1(12-46)Online publication date: Jan-2024
https://doi.org/10.1109/JPROC.2024.3353855
Tang SYu YWang HWang GChen WXu ZGuo SGao W(2024)A Survey on Scheduling Techniques in Computing and Network ConvergenceIEEE Communications Surveys & Tutorials10.1109/COMST.2023.332902726:1(160-195)Online publication date: Sep-2025
https://doi.org/10.1109/COMST.2023.3329027
Senjab KAbbas SAhmed NKhan A(2023)A survey of Kubernetes scheduling algorithmsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00471-112:1Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1186/s13677-023-00471-1
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Tairin SShen HZhang Z(2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605583
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten