skip to main content
10.1145/3326285.3329065acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwqosConference Proceedingsconference-collections
research-article

Chic: experience-driven scheduling in machine learning clusters

Published:24 June 2019Publication History

ABSTRACT

Large-scale machine learning (ML) models are routinely trained in a distributed fashion, due to their increasing complexity and data sizes. In a shared cluster handling multiple distributed learning workloads with a parameter server framework, it is important to determine the adequate number of concurrent workers and parameter servers for each ML workload over time, in order to minimize the average completion time and increase resource utilization. Existing schedulers for machine learning workloads involve meticulously designed heuristics. However, as the execution environment is highly complex and dynamic, it is challenging to construct an accurate model to make online decisions. In this paper, we design an experience-driven approach that learns to manage the cluster directly from experience rather than using a mathematical model. We propose Chic, a scheduler that is tailored for scheduling machine learning workloads in a cluster by leveraging deep reinforcement learning techniques. With our design of the state space, action space, and reward function, Chic trains a deep neural network with a modified version of the cross-entropy method to approximate the policy for assigning workers and parameter servers for future workloads based on the experience of the agent. Furthermore, a simplified version named Chic-Pair with a shorter training time for the policy is purposed by assigning workers and parameter servers in a pair. We compare Chic and Pair with state-of-the-art heuristics, and our results show that Chic and Chic-Pair are able to reduce the average training time significantly for machine learning workloads under a wide variety of conditions.

References

  1. M. Li, D. G. Anderson, J. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg," in Proceedings of the European Conference on Computer Systems (Eurosys), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler, "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the Annual Symposium on Cloud Computing (SoCC), 2013.Google ScholarGoogle Scholar
  4. Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, "Optimus: An efficient dynamic resource scheduler for deep learning clusters," in Proceedings of the European Conference on Computer Systems (Eurosys), 2018.Google ScholarGoogle Scholar
  5. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529--533, February 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. H. Mao, M. Alizadeh, I. Menache, and S. Kandula, "Resource management with deep reinforcement learning," in Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets), 2016.Google ScholarGoogle Scholar
  7. A. Mirhoseini, H. Pham, Q. L., M. Norouzi, S. Bengio, B. Steiner, Y. Zhou, N. Kumar, R. Larsen, and J. Dean, "Device placement optimization with reinforcement learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS) Workshop on Systems for Machine Learning and Open Source Software (LearningSys), 2015.Google ScholarGoogle Scholar
  10. Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM), 2018.Google ScholarGoogle Scholar
  11. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, "Mastering the game of go without human knowledge," Nature, vol. 550, pp. 354--359, October 2017.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. Rubinstein and D. Kroese, The Cross-Entropy Method. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, "Mesos: A platform for fine-grained resource sharing in the data center," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, "Dominant resource fairness: Fair allocation of multiple resource types," in Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, "Fuxi: A fault-tolerant resource management and job scheduling system at internet scale," in Proceedings of the VLDB Endowment (PVLDB), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand, "Firmament: Fast, centralized cluster scheduling at scale," in Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016.Google ScholarGoogle Scholar
  18. H. Zhang, L. Stafman, A. Or, and M. J. Freedman, "SLAQ: Quality-driven scheduling for distributed machine learning," in Proceedings of the Symposium on Cloud Computing (SoCC), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, pp. 484--489, January 2016.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Chic: experience-driven scheduling in machine learning clusters

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        IWQoS '19: Proceedings of the International Symposium on Quality of Service
        June 2019
        420 pages
        ISBN:9781450367783
        DOI:10.1145/3326285

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 June 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader