research-article

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Authors:

Xuanzhe LiuAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 266 - 280

https://doi.org/10.1145/3575693.3575721

Published: 30 January 2023 Publication History

Abstract

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.

References

[1]

2019. NCCL. https://developer.nvidia.com/nccl Retrieved on July 3, 2022

[2]

2021. AWS EC2 pricing. https://aws.amazon.com/ec2/pricing Retrieved on December 23, 2022

[3]

2022. Amazon SageMaker. https://aws.amazon.com/sagemaker/?nc1=h_ls Retrieved on July 3, 2022

[4]

2022. ElasticFlow Traces. https://github.com/microsoft/elasticflow-traces Retrieved on December 22, 2022

[5]

2022. gRPC. https://grpc.io Retrieved on July 3, 2022

[6]

2022. ND A100 v4-series. https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series Retrieved on December 23, 2022

[7]

2022. TorchElastic. https://pytorch.org/elastic/0.2.0/index.html Retrieved on July 3, 2022

[8]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016. 48, 173–182.

[9]

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. In Proceedings of 17th European Conference on Computer Systems, EuroSys 2022. 472–487. https://doi.org/10.1145/3492321.3519584

Digital Library

[10]

Dimitri P. Bertsekas and Robert G. Gallager. 1992. Data Networks, Second Edition. Prentice Hall.

[11]

Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K Jayaram, Michael Kalantar, Vinod Muthusamy, Priya Nagpurkar, and Florian Rosenberg. 2017. Scalable Multi-framework Multi-tenant Lifecycle Management of Deep Learning Training Jobs. In Workshop on ML Systems, NeurIPS 2017.

[12]

Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing Efficiency and Fairness in Heterogeneous GPU Clusters For Deep Learning. In Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020. 1:1–1:16. https://doi.org/10.1145/3342195.3387555

Digital Library

[13]

Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC 2020. 507–521. https://doi.org/10.1145/3419111.3421307

Digital Library

[14]

Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020. 750–762. https://doi.org/10.1145/3368089.3409759

Digital Library

[15]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232–1240.

[16]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.

[18]

Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. 2020. GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning. Journal of Machine Learning Research, 21, 76 (2020), 1–39.

[19]

Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2021. Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs. In Proceedings of the 12th ACM Symposium on Cloud Computing, Seattle, SoCC 2021. 609–623. https://doi.org/10.1145/3472883.3486978

Digital Library

[20]

Laurent George, Nicolas Rivierre, and Marco Spuri. 1996. Preemptive and non-preemptive real-time uniprocessor scheduling. Ph. D. Dissertation. Inria.

[21]

Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 485–500.

[22]

gudiandian. 2022. gudiandian/ElasticFlow: update traces. https://doi.org/10.5281/zenodo.7481637

Digital Library

[23]

Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017. 589–604. https://doi.org/10.1145/3064176.3064182

Digital Library

[24]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2016. 770–778. https://doi.org/10.1109/CVPR.2016.90

[26]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103–112.

[27]

Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021. 721–739.

[28]

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2022. 402–416. https://doi.org/10.1145/3503222.3507778

Digital Library

[29]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of 2019 USENIX Annual Technical Conference, ATC 2019. 947–960.

[30]

Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research.

[31]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 47–62. https://doi.org/10.1145/3341301.3359630

Digital Library

[32]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.

[33]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463–479.

[34]

Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, and Neeraja Yadwadkar. 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv preprint arXiv:1902.03383, https://doi.org/10.48550/arXiv.1902.03383

[35]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106–1114. https://doi.org/10.1145/3065386

Digital Library

[36]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583–598.

Digital Library

[37]

Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014. Communication Efficient Distributed Machine Learning with the Parameter Server. In Proceedings of 28th Annual Conference on Neural Information Processing Systems, NeurIPS 2014. 19–27.

[38]

Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019. 61–73. https://doi.org/10.1145/3357223.3362719

Digital Library

[39]

Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020. 401–416. https://doi.org/10.1145/3373376.3378499

Digital Library

[40]

Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 893–907. https://doi.org/10.1145/3297858.3304009

Digital Library

[41]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL 2022. 142–150.

[42]

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proceedings of 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020. 289–304.

[43]

Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter R. Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 937–954.

[44]

Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 481–498.

[45]

Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020, MLSys 2020.

[46]

Andrew Or, Haoyu Zhang, and Michael None Freedman. 2022. VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware. In Proceedings of Machine Learning and Systems 2022, MLSys 2022. 126–140.

[47]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

[48]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024–8035.

[49]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys 2018. 1–14. https://doi.org/10.1145/3190508.3190517

Digital Library

[50]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2021. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters. IEEE Trans. Parallel Distributed Syst., 32, 8 (2021), 1947–1960. https://doi.org/10.1109/TPDS.2021.3052895

[51]

Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proceedings of 2018 USENIX Annual Technical Conference, ATC 2018. 631–644.

[52]

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In Proceedings of 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021. 1–18.

[53]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1, 8 (2019), 9.

[54]

John E. Shore. 1975. On the External Storage Fragmentation Produced by First-Fit and Best-Fit Allocation Strategies. Commun. ACM, 18, 8 (1975), 433–440. https://doi.org/10.1145/360933.360949

Digital Library

[55]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of 3rd International Conference on Learning Representations, ICLR 2015.

[56]

Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 909–923. https://doi.org/10.1145/3297858.3304072

Digital Library

[57]

Suhas Jayaram Subramanya, Harsha Vardhan Simhadri, Srajan Garg, Anil Kag, and Venkatesh Balasubramanian. 2019. BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference? In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 469–484.

[58]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2818–2826. https://doi.org/10.1109/CVPR.2016.308

[59]

Jianyu Wang and Gauri Joshi. 2019. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.

[60]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguistics, 7 (2019), 625–641. https://doi.org/10.1162/tacl_a_00290

[61]

Jinfeng Wen, pengpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An Empirical Study on Challenges of Application Development in Serverless Computing. In Proceedings of 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021. 416–428. https://doi.org/10.1145/3468264.3468558

Digital Library

[62]

Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. 2022. Elastic Deep Learning in Multi-Tenant GPU Clusters. IEEE Trans. Parallel Distributed Syst., 33, 1 (2022), 144–158. https://doi.org/10.1109/TPDS.2021.3064966

[63]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018. 595–610.

[64]

Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan. 2020. Elan: Towards Generic and Efficient Elastic Training for Deep Learning. In Proceedings of 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020. https://doi.org/10.1109/ICDCS47774.2020.00018

[65]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of 2017 USENIX Annual Technical Conference, ATC 2017. 181–193.

[66]

Xin Zhang, Jia Liu, Zhengyuan Zhu, and Elizabeth S Bentley. 2019. Compressed Distributed Gradient Descent: Communication-efficient Consensus over Networks. In Proceedings of 2019 IEEE Conference on Computer Communications, INFOCOM 2019. 2431–2439. https://doi.org/10.1109/INFOCOM.2019.8737489

Digital Library

[67]

Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C. M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 515–532.

[68]

Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In Proceedings of ACM SIGCOMM 2022 Conference, SIGCOMM 2022. 428–440. https://doi.org/10.1145/3544216.3544224

Digital Library

[69]

Yiren Zhao, Ilia Shumailov, Robert D. Mullins, and Ross Anderson. 2019. To Compress Or Not To Compress: Understanding The Interactions Between Adversarial Attacks And Neural Network Compression. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.

Cited By

Lv CShi XLei ZHuang JTan WZheng XZhao XEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective ElasticityProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707251(311-325)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707251
Gu DZhao YSun PJin XLiu X(2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3470074
Li YMa XFu ZZhou AXu MWang S(2025)Rethinking Cost-Efficient VM Scheduling on Public Edge Platforms: A Service Provider’s PerspectiveIEEE Transactions on Mobile Computing10.1109/TMC.2024.348808224:3(1846-1858)Online publication date: Mar-2025
https://doi.org/10.1109/TMC.2024.3488082
Show More Cited By

Index Terms

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Distributed computing methodologies

Recommendations

PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters
Abstract
Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs ...
Highlights
- Existing GPU cluster scheduling strategies trade-off between fairness and efficiency.
- Aggregate job statistics are predictable for large-scale production GPU clusters.
- Our scheduler PPS achieves both fairness and efficiency by ...
FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms
SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Serverless computing platforms have become increasingly popular for running machine learning (ML) tasks due to their user-friendliness and decoupling from underlying infrastructure. However, auto-scaling to efficiently serve incoming requests still ...
Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
Abstract
Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning ...
Highlights
- OMRU is proposed to optimize makespan and resource utilization for multi-DNN training in GPU cluster.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

January 2023

947 pages

ISBN:9781450399166

DOI:10.1145/3575693

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Natural Science Fund for the Excellent Young Scientists Fund Program (Overseas)
Beijing Outstanding Young Scientist Program
Microsoft University Collaboration Program
PKU-Baidu Fund Project

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
2,035
Total Downloads

Downloads (Last 12 months)758
Downloads (Last 6 weeks)53

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lv CShi XLei ZHuang JTan WZheng XZhao XEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective ElasticityProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707251(311-325)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707251
Gu DZhao YSun PJin XLiu X(2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3470074
Li YMa XFu ZZhou AXu MWang S(2025)Rethinking Cost-Efficient VM Scheduling on Public Edge Platforms: A Service Provider’s PerspectiveIEEE Transactions on Mobile Computing10.1109/TMC.2024.348808224:3(1846-1858)Online publication date: Mar-2025
https://doi.org/10.1109/TMC.2024.3488082
Faisal AMartin NBashir HLamelas SDogar FGavrilovska ATerry D(2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691964
Zhao XYang SWang JDiao LQu LWu C(2024)FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning PlatformsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698548(443-459)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698548
Bhasi VSharma AJain RGunasekaran JPattnaik AKandemir MDas CSchiavoni VEdinger JCao JJin Z(2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700760
Bang JChoi YKim MKim YRhu M(2024)vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00021(153-167)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00021
Chen JXu FGu YChen LLiu FZhou Z(2024)HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682915(1-10)Online publication date: 19-Jun-2024
https://doi.org/10.1109/IWQoS61813.2024.10682915
Bhasi VSharma AMohanty SKandemir MDas C(2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00018
Fang JZhao GXu HLuo LYao ZXie A(2024)Non-Idle Machine-Aware Worker Placement for Efficient Distributed Training in GPU Clusters2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858582(1-11)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ICNP61940.2024.10858582
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten