skip to main content
10.1145/3575693.3575721acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Published: 30 January 2023 Publication History


This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the deep neural network (DNN) model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines while alleviating tedious, low-level, and manual resource management for deep learning developers. The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.46–7.65× compared to existing solutions.


2019. NCCL. Retrieved on July 3, 2022
2021. AWS EC2 pricing. Retrieved on December 23, 2022
2022. Amazon SageMaker. Retrieved on July 3, 2022
2022. ElasticFlow Traces. Retrieved on December 22, 2022
2022. gRPC. Retrieved on July 3, 2022
2022. ND A100 v4-series. Retrieved on December 23, 2022
2022. TorchElastic. Retrieved on July 3, 2022
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2016. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016. 48, 173–182.
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. In Proceedings of 17th European Conference on Computer Systems, EuroSys 2022. 472–487.
Dimitri P. Bertsekas and Robert G. Gallager. 1992. Data Networks, Second Edition. Prentice Hall.
Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K Jayaram, Michael Kalantar, Vinod Muthusamy, Priya Nagpurkar, and Florian Rosenberg. 2017. Scalable Multi-framework Multi-tenant Lifecycle Management of Deep Learning Training Jobs. In Workshop on ML Systems, NeurIPS 2017.
Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing Efficiency and Fairness in Heterogeneous GPU Clusters For Deep Learning. In Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020. 1:1–1:16.
Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC 2020. 507–521.
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020. 750–762.
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232–1240.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009. 248–255.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. 2020. GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning. Journal of Machine Learning Research, 21, 76 (2020), 1–39.
Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2021. Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs. In Proceedings of the 12th ACM Symposium on Cloud Computing, Seattle, SoCC 2021. 609–623.
Laurent George, Nicolas Rivierre, and Marco Spuri. 1996. Preemptive and non-preemptive real-time uniprocessor scheduling. Ph. D. Dissertation. Inria.
Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 485–500.
gudiandian. 2022. gudiandian/ElasticFlow: update traces.
Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R Ganger, and Phillip B Gibbons. 2017. Proteus: Agile ML Elasticity through Tiered Reliability in Dynamic Resource Markets. In Proceedings of the 12th European Conference on Computer Systems, EuroSys 2017. 589–604.
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2016. 770–778.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103–112.
Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021. 721–739.
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2022. 402–416.
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of 2019 USENIX Annual Technical Conference, ATC 2019. 947–960.
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical report, Microsoft Research.
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 47–62.
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463–479.
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, and Neeraja Yadwadkar. 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv preprint arXiv:1902.03383,
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106–1114.
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583–598.
Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014. Communication Efficient Distributed Machine Learning with the Parameter Server. In Proceedings of 28th Annual Conference on Neural Information Processing Systems, NeurIPS 2014. 19–27.
Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. 2019. HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019. 61–73.
Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020. 401–416.
Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware Decentralized Training. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 893–907.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL 2022. 142–150.
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proceedings of 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020. 289–304.
Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter R. Pietzuch. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 937–954.
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 481–498.
Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020, MLSys 2020.
Andrew Or, Haoyu Zhang, and Michael None Freedman. 2022. VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware. In Proceedings of Machine Learning and Systems 2022, MLSys 2022. 126–140.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015. 5206–5210.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024–8035.
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys 2018. 1–14.
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2021. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters. IEEE Trans. Parallel Distributed Syst., 32, 8 (2021), 1947–1960.
Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In Proceedings of 2018 USENIX Annual Technical Conference, ATC 2018. 631–644.
Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In Proceedings of 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021. 1–18.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1, 8 (2019), 9.
John E. Shore. 1975. On the External Storage Fragmentation Produced by First-Fit and Best-Fit Allocation Strategies. Commun. ACM, 18, 8 (1975), 433–440.
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of 3rd International Conference on Learning Representations, ICLR 2015.
Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019. 909–923.
Suhas Jayaram Subramanya, Harsha Vardhan Simhadri, Srajan Garg, Anil Kag, and Venkatesh Balasubramanian. 2019. BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference? In Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019. 469–484.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2818–2826.
Jianyu Wang and Gauri Joshi. 2019. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.
Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguistics, 7 (2019), 625–641.
Jinfeng Wen, pengpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An Empirical Study on Challenges of Application Development in Serverless Computing. In Proceedings of 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021. 416–428.
Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. 2022. Elastic Deep Learning in Multi-Tenant GPU Clusters. IEEE Trans. Parallel Distributed Syst., 33, 1 (2022), 144–158.
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018. 595–610.
Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan. 2020. Elan: Towards Generic and Efficient Elastic Training for Deep Learning. In Proceedings of 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020.
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of 2017 USENIX Annual Technical Conference, ATC 2017. 181–193.
Xin Zhang, Jia Liu, Zhengyuan Zhu, and Elizabeth S Bentley. 2019. Compressed Distributed Gradient Descent: Communication-efficient Consensus over Networks. In Proceedings of 2019 IEEE Conference on Computer Communications, INFOCOM 2019. 2431–2439.
Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C. M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 515–532.
Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In Proceedings of ACM SIGCOMM 2022 Conference, SIGCOMM 2022. 428–440.
Yiren Zhao, Ilia Shumailov, Robert D. Mullins, and Ross Anderson. 2019. To Compress Or Not To Compress: Understanding The Interactions Between Adversarial Attacks And Neural Network Compression. In Proceedings of Machine Learning and Systems 2019, MLSys 2019.

Cited By

View all
  • (2025)Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective ElasticityProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707251(311-325)Online publication date: 3-Feb-2025
  • (2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Feb-2025
  • (2025)Rethinking Cost-Efficient VM Scheduling on Public Edge Platforms: A Service Provider’s PerspectiveIEEE Transactions on Mobile Computing10.1109/TMC.2024.348808224:3(1846-1858)Online publication date: Mar-2025
  • Show More Cited By

Index Terms

  1. ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning



      Information & Contributors


      Published In

      cover image ACM Conferences
      ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
      January 2023
      947 pages
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 January 2023


      Request permissions for this article.

      Check for updates


      Author Tags

      1. Cluster Scheduling
      2. Distributed Deep Learning
      3. GPU Cluster
      4. Serverless Computing


      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • National Natural Science Fund for the Excellent Young Scientists Fund Program (Overseas)
      • Beijing Outstanding Young Scientist Program
      • Microsoft University Collaboration Program
      • PKU-Baidu Fund Project


      ASPLOS '23

      Acceptance Rates

      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • Downloads (Last 12 months)758
      • Downloads (Last 6 weeks)53
      Reflects downloads up to 20 Feb 2025

      Other Metrics


      Cited By

      View all
      • (2025)Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective ElasticityProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707251(311-325)Online publication date: 3-Feb-2025
      • (2025)GreenFlow: A Carbon-Efficient Scheduler for Deep Learning WorkloadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347007436:2(168-184)Online publication date: Feb-2025
      • (2025)Rethinking Cost-Efficient VM Scheduling on Public Edge Platforms: A Service Provider’s PerspectiveIEEE Transactions on Mobile Computing10.1109/TMC.2024.348808224:3(1846-1858)Online publication date: Mar-2025
      • (2024)When will my ML job finish? toward providing completion time estimates through predictability-centric schedulingProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691964(487-505)Online publication date: 10-Jul-2024
      • (2024)FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning PlatformsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698548(443-459)Online publication date: 20-Nov-2024
      • (2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
      • (2024)vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00021(153-167)Online publication date: 2-Nov-2024
      • (2024)HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682915(1-10)Online publication date: 19-Jun-2024
      • (2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
      • (2024)Non-Idle Machine-Aware Worker Placement for Efficient Distributed Training in GPU Clusters2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858582(1-11)Online publication date: 28-Oct-2024
      • Show More Cited By

      View Options

      Login options

      View options


      View or Download as a PDF file.



      View online with eReader.







      Share this Publication link

      Share on social media