skip to main content
10.1145/3605573.3605647acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

Published:13 September 2023Publication History

ABSTRACT

The parameters of deep learning (DL) have ballooned from millions to trillions over the past decade, thus cannot be fully placed in a limited GPU memory. Existing works offload the parameter-update stage of training DL model to CPU, thus leveraging CPU capability and memory to support training large-scale model. However, the stage running upon CPU could block its following stage running upon GPU, thus wasting expensive GPU-cycles. We first analyze the dataflow and workflow of DL training, and find that the backward stage and the parameter-update stage can be parallelized upon GPU and CPU respectively. To this end, we present a DL-training scheduling framework, CoTrain, to allocate a compute task and its corresponding data into GPU and CPU and to parallelize them effectively at coarse-grain and fine-grain way. Particularly, the fine-grained task-partition scheme allocates a portion of parameter-update stage to GPU according to data-reuse-distance, thus largely avoiding idleness of both GPU and CPU while reducing data movement between GPU and CPU. We build and evaluate CoTrain atop PyTorch under representative models. The results show that compared to the state-of-the-art ZeRO-Offload, CoTrain achieves 30.4% improvement in the training throughput while increasing model-size by up to 7% without changing the training semantics.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: a system for large-scale machine learning.. In Osdi, Vol. 16. Savannah, GA, USA, 265–283.Google ScholarGoogle Scholar
  2. Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks.. In FAST. 387–401.Google ScholarGoogle Scholar
  3. Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Springer, 177–186.Google ScholarGoogle ScholarCross RefCross Ref
  4. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.Google ScholarGoogle Scholar
  5. Yu Cao, Wei Bi, Meng Fang, and Dacheng Tao. 2020. Pretrained language models for dialogue generation with multiple input sources. arXiv preprint arXiv:2010.07576 (2020).Google ScholarGoogle Scholar
  6. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google ScholarGoogle Scholar
  7. Xiaoming Chen, Danny Z Chen, and Xiaobo Sharon Hu. 2018. moDNN: Memory optimal DNN training on GPUs. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 13–18.Google ScholarGoogle Scholar
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  9. Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 875–890.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.Google ScholarGoogle ScholarCross RefCross Ref
  11. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  12. Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, 2022. Whale: Efficient Giant Model Training over Heterogeneous { GPUs}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 673–688.Google ScholarGoogle Scholar
  13. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  14. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.Google ScholarGoogle Scholar
  15. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 583–598. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_muGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).Google ScholarGoogle Scholar
  17. Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient GPU Memory Sharing for Concurrent DNN Training.. In USENIX Annual Technical Conference. 161–175.Google ScholarGoogle Scholar
  18. Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).Google ScholarGoogle Scholar
  19. Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Looking Beyond { GPUs} for { DNN} Scheduling on { Multi-Tenant} Clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 579–596.Google ScholarGoogle Scholar
  20. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  22. Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891–905.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. (2018).Google ScholarGoogle Scholar
  24. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.Google ScholarGoogle ScholarCross RefCross Ref
  25. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. (2021).Google ScholarGoogle Scholar
  28. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  29. Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chenyang Zhang, Feng Zhang, Xiaoguang Guo, Bingsheng He, Xiao Zhang, and Xiaoyong Du. 2020. iMLBench: A machine learning benchmark suite for CPU-GPU integrated architectures. IEEE Transactions on Parallel and Distributed Systems 32, 7 (2020), 1740–1752.Google ScholarGoogle ScholarCross RefCross Ref
  31. Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. SparTA:Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213–232.Google ScholarGoogle Scholar
  32. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2017. Learning Transferable Architectures for Scalable Image Recognition. CoRR abs/1707.07012 (2017). arXiv:1707.07012http://arxiv.org/abs/1707.07012Google ScholarGoogle Scholar

Index Terms

  1. CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
      August 2023
      858 pages
      ISBN:9798400708435
      DOI:10.1145/3605573

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 September 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate91of313submissions,29%
    • Article Metrics

      • Downloads (Last 12 months)227
      • Downloads (Last 6 weeks)73

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format