ABSTRACT
Large-scale machine learning (ML) models are routinely trained in a distributed fashion, due to their increasing complexity and data sizes. In a shared cluster handling multiple distributed learning workloads with a parameter server framework, it is important to determine the adequate number of concurrent workers and parameter servers for each ML workload over time, in order to minimize the average completion time and increase resource utilization. Existing schedulers for machine learning workloads involve meticulously designed heuristics. However, as the execution environment is highly complex and dynamic, it is challenging to construct an accurate model to make online decisions. In this paper, we design an experience-driven approach that learns to manage the cluster directly from experience rather than using a mathematical model. We propose Chic, a scheduler that is tailored for scheduling machine learning workloads in a cluster by leveraging deep reinforcement learning techniques. With our design of the state space, action space, and reward function, Chic trains a deep neural network with a modified version of the cross-entropy method to approximate the policy for assigning workers and parameter servers for future workloads based on the experience of the agent. Furthermore, a simplified version named Chic-Pair with a shorter training time for the policy is purposed by assigning workers and parameter servers in a pair. We compare Chic and Pair with state-of-the-art heuristics, and our results show that Chic and Chic-Pair are able to reduce the average training time significantly for machine learning workloads under a wide variety of conditions.
- M. Li, D. G. Anderson, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014. Google ScholarDigital Library
- A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg," in Proceedings of the European Conference on Computer Systems (Eurosys), 2015. Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler, "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the Annual Symposium on Cloud Computing (SoCC), 2013.Google Scholar
- Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, "Optimus: An efficient dynamic resource scheduler for deep learning clusters," in Proceedings of the European Conference on Computer Systems (Eurosys), 2018.Google Scholar
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529--533, February 2015.Google ScholarCross Ref
- H. Mao, M. Alizadeh, I. Menache, and S. Kandula, "Resource management with deep reinforcement learning," in Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets), 2016.Google Scholar
- A. Mirhoseini, H. Pham, Q. L., M. Norouzi, S. Bengio, B. Steiner, Y. Zhou, N. Kumar, R. Larsen, and J. Dean, "Device placement optimization with reinforcement learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017. Google ScholarDigital Library
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016. Google ScholarDigital Library
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS) Workshop on Systems for Machine Learning and Open Source Software (LearningSys), 2015.Google Scholar
- Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM), 2018.Google Scholar
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998. Google ScholarDigital Library
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, "Mastering the game of go without human knowledge," Nature, vol. 550, pp. 354--359, October 2017.Google ScholarCross Ref
- R. Rubinstein and D. Kroese, The Cross-Entropy Method. Springer, 2004.Google ScholarCross Ref
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, "Mesos: A platform for fine-grained resource sharing in the data center," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, "Dominant resource fairness: Fair allocation of multiple resource types," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, "Fuxi: A fault-tolerant resource management and job scheduling system at internet scale," in Proceedings of the VLDB Endowment (PVLDB), 2014. Google ScholarDigital Library
- I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand, "Firmament: Fast, centralized cluster scheduling at scale," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.Google Scholar
- H. Zhang, L. Stafman, A. Or, and M. J. Freedman, "SLAQ: Quality-driven scheduling for distributed machine learning," in Proceedings of the Symposium on Cloud Computing (SoCC), 2017. Google ScholarDigital Library
- D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, pp. 484--489, January 2016.Google ScholarCross Ref
Index Terms
- Chic: experience-driven scheduling in machine learning clusters
Recommendations
Distributed deep reinforcement learning on the cloud for autonomous driving
SEFAIS '18: Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous SystemsThis paper proposes an architecture for leveraging cloud computing technology to reduce training time for deep reinforcement learning models for autonomous driving by distributing the training process across a pool of virtual machines. By parallelizing ...
Swift machine learning model serving scheduling: a region based reinforcement learning approach
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe success of machine learning has prospered Machine-Learning-as-a-Service (MLaaS) - deploying trained machine learning (ML) models in cloud to provide low latency inference services at scale. To meet latency Service-Level-Objective (SLO), judicious ...
Deep Reinforcement Learning-Based Workload Scheduling for Edge Computing
AbstractEdge computing is a new paradigm for providing cloud computing capacities at the edge of network near mobile users. It offers an effective solution to help mobile devices with computation-intensive and delay-sensitive tasks. However, the edge of ...
Comments