skip to main content
10.1145/3575693.3575721acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Published:30 January 2023Publication History
Related Artifact: ElasticFlow Artifact software https://doi.org/10.5281/zenodo.7481637

ABSTRACT

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.

References

  1. 2019. NCCL. https://developer.nvidia.com/nccl Retrieved on July 3, 2022 Google ScholarGoogle Scholar
  2. 2021. AWS EC2 pricing. https://aws.amazon.com/ec2/pricing Retrieved on December 23, 2022 Google ScholarGoogle Scholar
  3. 2022. Amazon SageMaker. https://aws.amazon.com/sagemaker/?nc1=h_ls Retrieved on July 3, 2022 Google ScholarGoogle Scholar
  4. 2022. ElasticFlow Traces. https://github.com/microsoft/elasticflow-traces Retrieved on December 22, 2022 Google ScholarGoogle Scholar
  5. 2022. gRPC. https://grpc.io Retrieved on July 3, 2022 Google ScholarGoogle Scholar
  6. 2022. ND A100 v4-series. https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series Retrieved on December 23, 2022 Google ScholarGoogle Scholar
  7. 2022. TorchElastic. https://pytorch.org/elastic/0.2.0/index.html Retrieved on July 3, 2022 Google ScholarGoogle Scholar
  8. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016. 48, 173–182. Google ScholarGoogle Scholar
  9. Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. In Proceedings of 17th European Conference on Computer Systems, EuroSys 2022. 472–487. https://doi.org/10.1145/3492321.3519584 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dimitri P. Bertsekas and Robert G. Gallager. 1992. Data Networks, Second Edition. Prentice Hall. Google ScholarGoogle Scholar
  11. Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K Jayaram, Michael Kalantar, Vinod Muthusamy, Priya Nagpurkar, and Florian Rosenberg. 2017. Scalable Multi-framework Multi-tenant Lifecycle Management of Deep Learning Training Jobs. In Workshop on ML Systems, NeurIPS 2017. Google ScholarGoogle Scholar
  12. Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing Efficiency and Fairness in Heterogeneous GPU Clusters For Deep Learning. In Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020. 1:1–1:16. https://doi.org/10.1145/3342195.3387555 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC 2020. 507–521. https://doi.org/10.1145/3419111.3421307 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020. 750–762. https://doi.org/10.1145/3368089.3409759 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232–1240. Google ScholarGoogle Scholar
  16. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255. https://doi.org/10.1109/CVPR.2009.5206848 Google ScholarGoogle ScholarCross RefCross Ref
  17. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186. Google ScholarGoogle Scholar
  18. Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. 2020. GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning. Journal of Machine Learning Research, 21, 76 (2020), 1–39. Google ScholarGoogle Scholar
  19. Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2021. Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs. In Proceedings of the 12th ACM Symposium on Cloud Computing, Seattle, SoCC 2021. 609–623. https://doi.org/10.1145/3472883.3486978 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Laurent George, Nicolas Rivierre, and Marco Spuri. 1996. Preemptive and non-preemptive real-time uniprocessor scheduling. Ph. D. Dissertation. Inria. Google ScholarGoogle Scholar
  21. Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 485–500. Google ScholarGoogle Scholar
  22. gudiandian. 2022. gudiandian/ElasticFlow: update traces. https://doi.org/10.5281/zenodo.7481637 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017. 589–604. https://doi.org/10.1145/3064176.3064182 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google ScholarGoogle Scholar
  25. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2016. 770–778. https://doi.org/10.1109/CVPR.2016.90 Google ScholarGoogle ScholarCross RefCross Ref
  26. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103–112. Google ScholarGoogle Scholar
  27. Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021. 721–739. Google ScholarGoogle Scholar
  28. Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2022. 402–416. https://doi.org/10.1145/3503222.3507778 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of 2019 USENIX Annual Technical Conference, ATC 2019. 947–960. Google ScholarGoogle Scholar
  30. Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research. Google ScholarGoogle Scholar
  31. Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 47–62. https://doi.org/10.1145/3341301.3359630 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google ScholarGoogle Scholar
  33. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463–479. Google ScholarGoogle Scholar
  34. Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, and Neeraja Yadwadkar. 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv preprint arXiv:1902.03383, https://doi.org/10.48550/arXiv.1902.03383 Google ScholarGoogle Scholar
  35. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106–1114. https://doi.org/10.1145/3065386 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583–598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014. Communication Efficient Distributed Machine Learning with the Parameter Server. In Proceedings of 28th Annual Conference on Neural Information Processing Systems, NeurIPS 2014. 19–27. Google ScholarGoogle Scholar
  38. Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019. 61–73. https://doi.org/10.1145/3357223.3362719 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020. 401–416. https://doi.org/10.1145/3373376.3378499 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 893–907. https://doi.org/10.1145/3297858.3304009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL 2022. 142–150. Google ScholarGoogle Scholar
  42. Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proceedings of 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020. 289–304. Google ScholarGoogle Scholar
  43. Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter R. Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 937–954. Google ScholarGoogle Scholar
  44. Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 481–498. Google ScholarGoogle Scholar
  45. Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020, MLSys 2020. Google ScholarGoogle Scholar
  46. Andrew Or, Haoyu Zhang, and Michael None Freedman. 2022. VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware. In Proceedings of Machine Learning and Systems 2022, MLSys 2022. 126–140. Google ScholarGoogle Scholar
  47. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964 Google ScholarGoogle ScholarCross RefCross Ref
  48. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024–8035. Google ScholarGoogle Scholar
  49. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys 2018. 1–14. https://doi.org/10.1145/3190508.3190517 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2021. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters. IEEE Trans. Parallel Distributed Syst., 32, 8 (2021), 1947–1960. https://doi.org/10.1109/TPDS.2021.3052895 Google ScholarGoogle ScholarCross RefCross Ref
  51. Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proceedings of 2018 USENIX Annual Technical Conference, ATC 2018. 631–644. Google ScholarGoogle Scholar
  52. Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In Proceedings of 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021. 1–18. Google ScholarGoogle Scholar
  53. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1, 8 (2019), 9. Google ScholarGoogle Scholar
  54. John E. Shore. 1975. On the External Storage Fragmentation Produced by First-Fit and Best-Fit Allocation Strategies. Commun. ACM, 18, 8 (1975), 433–440. https://doi.org/10.1145/360933.360949 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of 3rd International Conference on Learning Representations, ICLR 2015. Google ScholarGoogle Scholar
  56. Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 909–923. https://doi.org/10.1145/3297858.3304072 Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Suhas Jayaram Subramanya, Harsha Vardhan Simhadri, Srajan Garg, Anil Kag, and Venkatesh Balasubramanian. 2019. BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference? In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 469–484. Google ScholarGoogle Scholar
  58. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2818–2826. https://doi.org/10.1109/CVPR.2016.308 Google ScholarGoogle ScholarCross RefCross Ref
  59. Jianyu Wang and Gauri Joshi. 2019. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google ScholarGoogle Scholar
  60. Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguistics, 7 (2019), 625–641. https://doi.org/10.1162/tacl_a_00290 Google ScholarGoogle ScholarCross RefCross Ref
  61. Jinfeng Wen, pengpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An Empirical Study on Challenges of Application Development in Serverless Computing. In Proceedings of 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021. 416–428. https://doi.org/10.1145/3468264.3468558 Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. 2022. Elastic Deep Learning in Multi-Tenant GPU Clusters. IEEE Trans. Parallel Distributed Syst., 33, 1 (2022), 144–158. https://doi.org/10.1109/TPDS.2021.3064966 Google ScholarGoogle ScholarCross RefCross Ref
  63. Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018. 595–610. Google ScholarGoogle Scholar
  64. Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan. 2020. Elan: Towards Generic and Efficient Elastic Training for Deep Learning. In Proceedings of 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020. https://doi.org/10.1109/ICDCS47774.2020.00018 Google ScholarGoogle ScholarCross RefCross Ref
  65. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of 2017 USENIX Annual Technical Conference, ATC 2017. 181–193. Google ScholarGoogle Scholar
  66. Xin Zhang, Jia Liu, Zhengyuan Zhu, and Elizabeth S Bentley. 2019. Compressed Distributed Gradient Descent: Communication-efficient Consensus over Networks. In Proceedings of 2019 IEEE Conference on Computer Communications, INFOCOM 2019. 2431–2439. https://doi.org/10.1109/INFOCOM.2019.8737489 Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C. M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 515–532. Google ScholarGoogle Scholar
  68. Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In Proceedings of ACM SIGCOMM 2022 Conference, SIGCOMM 2022. 428–440. https://doi.org/10.1145/3544216.3544224 Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Yiren Zhao, Ilia Shumailov, Robert D. Mullins, and Ross Anderson. 2019. To Compress Or Not To Compress: Understanding The Interactions Between Adversarial Attacks And Neural Network Compression. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google ScholarGoogle Scholar

Index Terms

  1. ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader