skip to main content
10.1145/3419111.3421299acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments

Published: 12 October 2020 Publication History

Abstract

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers---workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being non-intrusive to intra-iteration execution. We develop LB-BSP based on such an insight, which is an integrated worker coordination mechanism that adapts workers' load to their instantaneous processing capabilities by right-sizing the sample batches at the synchronization barriers. We have custom-designed the batch sizing algorithm respectively for CPU and GPU clusters based on their own characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment confirms that LB-BSP is practical, effective and light-weight, and is able to accelerating distributed training by up to 54%.

Supplementary Material

MP4 File (p431-chen-presentation.mp4)

References

[1]
2019. Stress-ng: a tool to load and stress a computer system. http://manpages.ubuntu.com/manpages/artful/man1/stress-ng.1.html.
[2]
2019. Train Deep Learning Models on GPUs using Amazon EC2 Spot Instances. https://aws.amazon.com/en/blogs/machine-learning/train-deep-learning-models-on-gpus-using-amazon-ec2-spot-instances/.
[3]
2020. Apache Thrift. https://thrift.apache.org/.
[4]
2020. EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.
[5]
2020. FloydHub. https://floydhub.com/.
[6]
2020. GlusterFS. https://docs.gluster.org/en/latest/.
[7]
2020. Keras. https://keras.io/.
[8]
2020. Python Psutil. https://psutil.readthedocs.io/en/latest/.
[9]
2020. PyTorch. https://pytorch.org/.
[10]
2020. Wonder Shaper. https://github.com/magnific0/wondershaper.
[11]
Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In USENIX OSDI.
[12]
Umut A Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling parallel programs by work stealing with private deques. In ACM SIGPLAN Notices.
[13]
Bilge Acun and Laxmikant V Kale. 2016. Mitigating processor variation through dynamic load balancing. In IEEE IPDPSW.
[14]
Klaithem Al Nuaimi, Nader Mohamed, Mariam Al Nuaimi, and Jameela Al-Jaroodi. 2012. A survey of load balancing in cloud computing: Challenges and algorithms. In IEEE NCA.
[15]
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI.
[16]
Dimitrios Argyropoulos, Dimitris S Paraforos, Rainer Alex, Hans W Griepentrog, and Joachim Muller. 2016. NARX neural network modelling of mushroom dynamic vapour sorption kinetics. IFAC-PapersOnLine 49, 16 (2016), 305--310.
[17]
Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multi-threaded computations by work stealing. JACM 46, 5 (1999), 720--748.
[18]
Erasmo Cadenas, Wilfrido Rivera, Rafael Campos-Amezcua, and Roberto Cadenas. 2016. Wind speed forecasting using the NARX model, case: La Mata, Oaxaca, México. Neural Computing and Applications 27, 8 (2016), 2417--2428.
[19]
Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. Proceedings of the Fifteenth European Conference on Computer Systems (2020).
[20]
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).
[21]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[22]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX OSDI.
[23]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, Aug (2011), 2493--2537.
[24]
Jerome T Connor, R Douglas Martin, and Les E Atlas. 1994. Recurrent neural networks and robust time series prediction. IEEE Trans. on Neural Networks 5, 2 (1994), 240--254.
[25]
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms.
[26]
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, et al. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In USENIX ATC.
[27]
Henggang Cui, Hao Zhang, Gregory R Ganger, Phillip B Gibbons, and Eric P Xing. 2016. GeePS: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In ACM Eurosys.
[28]
Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker 2014. Adaptive stream processing using dynamic batch sizing. In ACM SoCC.
[29]
Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. 2017. Automated inference with adaptive batches. In Artificial Intelligence and Statistics. 1504--1513.
[30]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and Andrew Ng. 2012. Large scale distributed deep networks. In NIPS.
[31]
Jeffrey Dean and Sanjay Ghemawat. 2010. MapReduce: a flexible data processing tool. Commun. ACM 53, 1 (2010), 72--77.
[32]
Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks. arXiv preprint arXiv:1712.02029 (2017).
[33]
Eugen Diaconescu. 2008. The use of NARX neural networks to predict chaotic time series. Wseas Transactions on computer research 3, 3 (2008), 182--191.
[34]
James Dinan, D Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In IEEE/ACM SC.
[35]
Volkan Ş Ediger and Sertac Akar. 2007. ARIMA forecasting of primary energy demand by fuel in Turkey. Energy Policy 35, 3 (2007), 1701--1708.
[36]
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1915--1929.
[37]
Yang Gao and Meng Joo Er. 2003. NARMAX-model-based time series modeling and prediction: feedforward and recurrent fuzzy neural network approaches. In WSEAS CSECS.
[38]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
[39]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In USENIX NSDI.
[40]
Jiazhen Gu, Huan Liu, Yangfan Zhou, and Xin Wang. 2017. DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns. arXiv preprint arXiv:1707.03750 (2017).
[41]
Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In ACM SoCC.
[42]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In IEEE HPCA.
[43]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE CVPR.
[44]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In NIPS.
[45]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[46]
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. MSR Technical Report 2018--13 (2018).
[47]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv preprint arXiv:1807.11205 (2018).
[48]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
[49]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In ACM SIGMOD.
[50]
Yuriy Kochura, Yuri Gordienko, Vlad Taran, Nikita Gordienko, Alexandr Rokovyi, Oleg Alienin, and Sergii Stirenko. 2018. Batch Size Influence on Performance of Graphic and Tensor Processing Units during Training and Inference Phases. arXiv preprint arXiv:1812.11731 (2018).
[51]
T Kokilavani, Dr DI George Amalarethinam, et al. 2011. Load balanced min-min algorithm for static meta-task scheduling in grid computing. International Journal of Computer Applications 20, 2 (2011), 43--49.
[52]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009).
[53]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.
[54]
John Langford, Alexander J Smola, and Martin Zinkevich. 2009. Slow learners are fast. In NIPS.
[55]
Tan Le, Xiao Shu Sun, Mosharaf Chowdhury, and Zhenhua Liu. 2020. AlloX: compute allocation in hybrid clusters. Proceedings of the Fifteenth European Conference on Computer Systems (2020).
[56]
Ang Li. 2016. GPU performance modeling and optimization. Ph.D. Dissertation. Technische Universiteit Eindhoven.
[57]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In USENIX OSDI.
[58]
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. Efficient mini-batch training for stochastic optimization. In ACM KDD.
[59]
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: towards multi-tenant resource sharing for machine learning workloads. VLDB (2018).
[60]
Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In IEEE HPCA.
[61]
Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2009. Identifying suspicious URLs: an application of large-scale online learning. In ACM ICML.
[62]
Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Exploiting GPU hardware saturation for fast compiler optimization. In ACM GPGPU.
[63]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In NSDI.
[64]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In USENIX OSDI.
[65]
T Ozaki. 1977. On the order determination of ARIMA models. Applied Statistics (1977), 290--301.
[66]
Jay H Park, Gyeongchan Yun, Chang M Yi, Nguyen T Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In USENIX ATC.
[67]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In ACM Eurosys.
[68]
Nicholas G Polson and Vadim O Sokolov. 2017. Deep learning for short-term traffic flow prediction. Transportation Research Part C: Emerging Technologies 79 (2017), 1--17.
[69]
Thorbjörn Posewsky and Daniel Ziener. 2018. Throughput optimizations for FPGA-based deep neural network inference. Microprocessors and Microsystems 60 (2018), 151--161.
[70]
Akhter Mohiuddin Rather, Arun Agarwal, and VN Sastry. 2015. Recurrent neural network and a hybrid model for prediction of stock returns. Expert Systems with Applications 42, 6 (2015), 3234--3241.
[71]
Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In ACM SoCC.
[72]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
[73]
Pooja Samal and Pranati Mishra. 2013. Analysis of variants in Round Robin Algorithms for load balancing in Cloud Computing. International Journal of computer science and Information Technologies 4, 3 (2013), 416--419.
[74]
Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX FAST.
[75]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In IEEE MSST.
[76]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
[77]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE CVPR.
[78]
Xueyan Tang and Samuel T Chanson. 2000. Optimizing static job scheduling in a network of heterogeneous computers. In IEEE ICPP.
[79]
Asser N Tantawi and Don Towsley. 1985. Optimal static load balancing in distributed computer systems. Journal of the ACM (JACM) 32, 2 (1985), 445--465.
[80]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In ACM SIGMOD.
[81]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: introspective cluster scheduling for deep learning. In USENIX OSDI.
[82]
Jixiang Yang and Qingbi He. 2018. Scheduling parallel computations by work stealing: A survey. International Journal of Parallel Programming 46, 2 (2018), 173--197.
[83]
Yang Yang, De-Chuan Zhan, Ying Fan, Yuan Jiang, and Zhi-Hua Zhou. 2017. Deep Learning for Fixed Model Reuse. In AAAI. 2831--2837.
[84]
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation 26, 2 (2007), 289--315.
[85]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI.
[86]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In ACM SOSP.
[87]
Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In USENIX OSDI.
[88]
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: quality-driven scheduling for distributed machine learning. In ACM SoCC.
[89]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX ATC.
[90]
Quan Zhang, Yang Song, Ramani R Routray, and Weisong Shi. 2016. Adaptive block and batch sizing for batched stream processing system. In IEEE ICAC.

Cited By

View all
  • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
  • (2025)Synergistic Distributed CNN Model for Protein Classification With a Collaborative BSP Synchronization Based on LSTM PredictionConcurrency and Computation: Practice and Experience10.1002/cpe.7002537:4-5Online publication date: 14-Feb-2025
  • (2024)MetisProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692027(563-578)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
      October 2020
      535 pages
      ISBN:9781450381376
      DOI:10.1145/3419111
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. distributed learning
      2. load balancing
      3. synchronization

      Qualifiers

      • Research-article

      Funding Sources

      • RGC GRF

      Conference

      SoCC '20
      Sponsor:
      SoCC '20: ACM Symposium on Cloud Computing
      October 19 - 21, 2020
      Virtual Event, USA

      Acceptance Rates

      SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;
      Overall Acceptance Rate 169 of 722 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
      • (2025)Synergistic Distributed CNN Model for Protein Classification With a Collaborative BSP Synchronization Based on LSTM PredictionConcurrency and Computation: Practice and Experience10.1002/cpe.7002537:4-5Online publication date: 14-Feb-2025
      • (2024)MetisProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692027(563-578)Online publication date: 10-Jul-2024
      • (2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
      • (2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
      • (2024)Locality-Aware and Fault-Tolerant Batching for Machine Learning on Distributed DatasetsIEEE Transactions on Cloud Computing10.1109/TCC.2024.335171612:2(370-387)Online publication date: Apr-2024
      • (2024)An Analysis of Network Overhead in Distributed TinyML2024 IEEE/ACM Symposium on Edge Computing (SEC)10.1109/SEC62691.2024.00051(449-455)Online publication date: 4-Dec-2024
      • (2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024
      • (2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
      • (2024)Interference-aware opportunistic job placement for shared distributed deep learning clustersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104776183(104776)Online publication date: Jan-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media