ABSTRACT
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.
- 2019. NCCL. https://developer.nvidia.com/nccl Retrieved on July 3, 2022 Google Scholar
- 2021. AWS EC2 pricing. https://aws.amazon.com/ec2/pricing Retrieved on December 23, 2022 Google Scholar
- 2022. Amazon SageMaker. https://aws.amazon.com/sagemaker/?nc1=h_ls Retrieved on July 3, 2022 Google Scholar
- 2022. ElasticFlow Traces. https://github.com/microsoft/elasticflow-traces Retrieved on December 22, 2022 Google Scholar
- 2022. gRPC. https://grpc.io Retrieved on July 3, 2022 Google Scholar
- 2022. ND A100 v4-series. https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series Retrieved on December 23, 2022 Google Scholar
- 2022. TorchElastic. https://pytorch.org/elastic/0.2.0/index.html Retrieved on July 3, 2022 Google Scholar
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016. 48, 173–182. Google Scholar
- Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. In Proceedings of 17th European Conference on Computer Systems, EuroSys 2022. 472–487. https://doi.org/10.1145/3492321.3519584 Google ScholarDigital Library
- Dimitri P. Bertsekas and Robert G. Gallager. 1992. Data Networks, Second Edition. Prentice Hall. Google Scholar
- Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K Jayaram, Michael Kalantar, Vinod Muthusamy, Priya Nagpurkar, and Florian Rosenberg. 2017. Scalable Multi-framework Multi-tenant Lifecycle Management of Deep Learning Training Jobs. In Workshop on ML Systems, NeurIPS 2017. Google Scholar
- Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing Efficiency and Fairness in Heterogeneous GPU Clusters For Deep Learning. In Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020. 1:1–1:16. https://doi.org/10.1145/3342195.3387555 Google ScholarDigital Library
- Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC 2020. 507–521. https://doi.org/10.1145/3419111.3421307 Google ScholarDigital Library
- Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020. 750–762. https://doi.org/10.1145/3368089.3409759 Google ScholarDigital Library
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232–1240. Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255. https://doi.org/10.1109/CVPR.2009.5206848 Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186. Google Scholar
- Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. 2020. GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning. Journal of Machine Learning Research, 21, 76 (2020), 1–39. Google Scholar
- Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2021. Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs. In Proceedings of the 12th ACM Symposium on Cloud Computing, Seattle, SoCC 2021. 609–623. https://doi.org/10.1145/3472883.3486978 Google ScholarDigital Library
- Laurent George, Nicolas Rivierre, and Marco Spuri. 1996. Preemptive and non-preemptive real-time uniprocessor scheduling. Ph. D. Dissertation. Inria. Google Scholar
- Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 485–500. Google Scholar
- gudiandian. 2022. gudiandian/ElasticFlow: update traces. https://doi.org/10.5281/zenodo.7481637 Google ScholarDigital Library
- Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017. 589–604. https://doi.org/10.1145/3064176.3064182 Google ScholarDigital Library
- Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2016. 770–778. https://doi.org/10.1109/CVPR.2016.90 Google ScholarCross Ref
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103–112. Google Scholar
- Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021. 721–739. Google Scholar
- Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2022. 402–416. https://doi.org/10.1145/3503222.3507778 Google ScholarDigital Library
- Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of 2019 USENIX Annual Technical Conference, ATC 2019. 947–960. Google Scholar
- Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research. Google Scholar
- Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 47–62. https://doi.org/10.1145/3341301.3359630 Google ScholarDigital Library
- Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google Scholar
- Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463–479. Google Scholar
- Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, and Neeraja Yadwadkar. 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv preprint arXiv:1902.03383, https://doi.org/10.48550/arXiv.1902.03383 Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106–1114. https://doi.org/10.1145/3065386 Google ScholarDigital Library
- Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583–598. Google ScholarDigital Library
- Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014. Communication Efficient Distributed Machine Learning with the Parameter Server. In Proceedings of 28th Annual Conference on Neural Information Processing Systems, NeurIPS 2014. 19–27. Google Scholar
- Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019. 61–73. https://doi.org/10.1145/3357223.3362719 Google ScholarDigital Library
- Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020. 401–416. https://doi.org/10.1145/3373376.3378499 Google ScholarDigital Library
- Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 893–907. https://doi.org/10.1145/3297858.3304009 Google ScholarDigital Library
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL 2022. 142–150. Google Scholar
- Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proceedings of 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020. 289–304. Google Scholar
- Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter R. Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 937–954. Google Scholar
- Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 481–498. Google Scholar
- Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020, MLSys 2020. Google Scholar
- Andrew Or, Haoyu Zhang, and Michael None Freedman. 2022. VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware. In Proceedings of Machine Learning and Systems 2022, MLSys 2022. 126–140. Google Scholar
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964 Google ScholarCross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024–8035. Google Scholar
- Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys 2018. 1–14. https://doi.org/10.1145/3190508.3190517 Google ScholarDigital Library
- Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2021. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters. IEEE Trans. Parallel Distributed Syst., 32, 8 (2021), 1947–1960. https://doi.org/10.1109/TPDS.2021.3052895 Google ScholarCross Ref
- Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proceedings of 2018 USENIX Annual Technical Conference, ATC 2018. 631–644. Google Scholar
- Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In Proceedings of 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021. 1–18. Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1, 8 (2019), 9. Google Scholar
- John E. Shore. 1975. On the External Storage Fragmentation Produced by First-Fit and Best-Fit Allocation Strategies. Commun. ACM, 18, 8 (1975), 433–440. https://doi.org/10.1145/360933.360949 Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of 3rd International Conference on Learning Representations, ICLR 2015. Google Scholar
- Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 909–923. https://doi.org/10.1145/3297858.3304072 Google ScholarDigital Library
- Suhas Jayaram Subramanya, Harsha Vardhan Simhadri, Srajan Garg, Anil Kag, and Venkatesh Balasubramanian. 2019. BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference? In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 469–484. Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2818–2826. https://doi.org/10.1109/CVPR.2016.308 Google ScholarCross Ref
- Jianyu Wang and Gauri Joshi. 2019. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google Scholar
- Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguistics, 7 (2019), 625–641. https://doi.org/10.1162/tacl_a_00290 Google ScholarCross Ref
- Jinfeng Wen, pengpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An Empirical Study on Challenges of Application Development in Serverless Computing. In Proceedings of 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021. 416–428. https://doi.org/10.1145/3468264.3468558 Google ScholarDigital Library
- Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. 2022. Elastic Deep Learning in Multi-Tenant GPU Clusters. IEEE Trans. Parallel Distributed Syst., 33, 1 (2022), 144–158. https://doi.org/10.1109/TPDS.2021.3064966 Google ScholarCross Ref
- Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018. 595–610. Google Scholar
- Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan. 2020. Elan: Towards Generic and Efficient Elastic Training for Deep Learning. In Proceedings of 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020. https://doi.org/10.1109/ICDCS47774.2020.00018 Google ScholarCross Ref
- Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of 2017 USENIX Annual Technical Conference, ATC 2017. 181–193. Google Scholar
- Xin Zhang, Jia Liu, Zhengyuan Zhu, and Elizabeth S Bentley. 2019. Compressed Distributed Gradient Descent: Communication-efficient Consensus over Networks. In Proceedings of 2019 IEEE Conference on Computer Communications, INFOCOM 2019. 2431–2439. https://doi.org/10.1109/INFOCOM.2019.8737489 Google ScholarDigital Library
- Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C. M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 515–532. Google Scholar
- Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In Proceedings of ACM SIGCOMM 2022 Conference, SIGCOMM 2022. 428–440. https://doi.org/10.1145/3544216.3544224 Google ScholarDigital Library
- Yiren Zhao, Ilia Shumailov, Robert D. Mullins, and Ross Anderson. 2019. To Compress Or Not To Compress: Understanding The Interactions Between Adversarial Attacks And Neural Network Compression. In Proceedings of Machine Learning and Systems 2019, MLSys 2019. Google Scholar
Index Terms
- ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
Recommendations
Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
AbstractDeep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning ...
Highlights- OMRU is proposed to optimize makespan and resource utilization for multi-DNN training in GPU cluster.
Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel ProcessingServerless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster
ICPP '20: Proceedings of the 49th International Conference on Parallel ProcessingWith the prosperity of deep learning, enterprises, and large platform providers, such as Microsoft, Amazon, and Google, have built and provided GPU clusters to facilitate distributed deep learning training. As deep learning training workloads are ...
Comments