research-article

Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments

Authors:

Bo LiAuthors Info & Claims

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

Pages 431 - 446

https://doi.org/10.1145/3419111.3421299

Published: 12 October 2020 Publication History

Abstract

Machine learning (ML) models are increasingly trained in clusters with non-dedicated workers possessing heterogeneous resources. In such scenarios, model training efficiency can be negatively affected by stragglers---workers that run much slower than others. Efficient model training requires eliminating such stragglers, yet for modern ML workloads, existing load balancing strategies are inefficient and even infeasible. In this paper, we propose a novel strategy called semi-dynamic load balancing to eliminate stragglers of distributed ML workloads. The key insight is that ML workers shall be load-balanced at iteration boundaries, being non-intrusive to intra-iteration execution. We develop LB-BSP based on such an insight, which is an integrated worker coordination mechanism that adapts workers' load to their instantaneous processing capabilities by right-sizing the sample batches at the synchronization barriers. We have custom-designed the batch sizing algorithm respectively for CPU and GPU clusters based on their own characteristics. LB-BSP has been implemented as a Python module for ML frameworks like TensorFlow and PyTorch. Our EC2 deployment confirms that LB-BSP is practical, effective and light-weight, and is able to accelerating distributed training by up to 54%.

Supplementary Material

MP4 File (p431-chen-presentation.mp4)

Download
290.54 MB

References

[1]

2019. Stress-ng: a tool to load and stress a computer system. http://manpages.ubuntu.com/manpages/artful/man1/stress-ng.1.html.

[2]

2019. Train Deep Learning Models on GPUs using Amazon EC2 Spot Instances. https://aws.amazon.com/en/blogs/machine-learning/train-deep-learning-models-on-gpus-using-amazon-ec2-spot-instances/.

[3]

2020. Apache Thrift. https://thrift.apache.org/.

[4]

2020. EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.

[5]

2020. FloydHub. https://floydhub.com/.

[6]

2020. GlusterFS. https://docs.gluster.org/en/latest/.

[7]

2020. Keras. https://keras.io/.

[8]

2020. Python Psutil. https://psutil.readthedocs.io/en/latest/.

[9]

2020. PyTorch. https://pytorch.org/.

[10]

2020. Wonder Shaper. https://github.com/magnific0/wondershaper.

[11]

Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In USENIX OSDI.

[12]

Umut A Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling parallel programs by work stealing with private deques. In ACM SIGPLAN Notices.

[13]

Bilge Acun and Laxmikant V Kale. 2016. Mitigating processor variation through dynamic load balancing. In IEEE IPDPSW.

[14]

Klaithem Al Nuaimi, Nader Mohamed, Mariam Al Nuaimi, and Jameela Al-Jaroodi. 2012. A survey of load balancing in cloud computing: Challenges and algorithms. In IEEE NCA.

[15]

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI.

[16]

Dimitrios Argyropoulos, Dimitris S Paraforos, Rainer Alex, Hans W Griepentrog, and Joachim Muller. 2016. NARX neural network modelling of mushroom dynamic vapour sorption kinetics. IFAC-PapersOnLine 49, 16 (2016), 305--310.

[17]

Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multi-threaded computations by work stealing. JACM 46, 5 (1999), 720--748.

Digital Library

[18]

Erasmo Cadenas, Wilfrido Rivera, Rafael Campos-Amezcua, and Roberto Cadenas. 2016. Wind speed forecasting using the NARX model, case: La Mata, Oaxaca, México. Neural Computing and Applications 27, 8 (2016), 2417--2428.

Digital Library

[19]

Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. Proceedings of the Fifteenth European Conference on Computer Systems (2020).

Digital Library

[20]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).

[21]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[22]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX OSDI.

[23]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, Aug (2011), 2493--2537.

Digital Library

[24]

Jerome T Connor, R Douglas Martin, and Les E Atlas. 1994. Recurrent neural networks and robust time series prediction. IEEE Trans. on Neural Networks 5, 2 (1994), 240--254.

Digital Library

[25]

Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms.

Digital Library

[26]

Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, et al. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In USENIX ATC.

[27]

Henggang Cui, Hao Zhang, Gregory R Ganger, Phillip B Gibbons, and Eric P Xing. 2016. GeePS: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In ACM Eurosys.

[28]

Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker 2014. Adaptive stream processing using dynamic batch sizing. In ACM SoCC.

[29]

Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. 2017. Automated inference with adaptive batches. In Artificial Intelligence and Statistics. 1504--1513.

[30]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and Andrew Ng. 2012. Large scale distributed deep networks. In NIPS.

[31]

Jeffrey Dean and Sanjay Ghemawat. 2010. MapReduce: a flexible data processing tool. Commun. ACM 53, 1 (2010), 72--77.

Digital Library

[32]

Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks. arXiv preprint arXiv:1712.02029 (2017).

[33]

Eugen Diaconescu. 2008. The use of NARX neural networks to predict chaotic time series. Wseas Transactions on computer research 3, 3 (2008), 182--191.

Digital Library

[34]

James Dinan, D Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In IEEE/ACM SC.

[35]

Volkan Ş Ediger and Sertac Akar. 2007. ARIMA forecasting of primary energy demand by fuel in Turkey. Energy Policy 35, 3 (2007), 1701--1708.

[36]

Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1915--1929.

Digital Library

[37]

Yang Gao and Meng Joo Er. 2003. NARMAX-model-based time series modeling and prediction: feedforward and recurrent fuzzy neural network approaches. In WSEAS CSECS.

[38]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).

[39]

Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In USENIX NSDI.

[40]

Jiazhen Gu, Huan Liu, Yangfan Zhou, and Xin Wang. 2017. DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns. arXiv preprint arXiv:1707.03750 (2017).

[41]

Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In ACM SoCC.

[42]

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In IEEE HPCA.

[43]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE CVPR.

[44]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In NIPS.

Digital Library

[45]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[46]

Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. MSR Technical Report 2018--13 (2018).

[47]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv preprint arXiv:1807.11205 (2018).

[48]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.

Digital Library

[49]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In ACM SIGMOD.

[50]

Yuriy Kochura, Yuri Gordienko, Vlad Taran, Nikita Gordienko, Alexandr Rokovyi, Oleg Alienin, and Sergii Stirenko. 2018. Batch Size Influence on Performance of Graphic and Tensor Processing Units during Training and Inference Phases. arXiv preprint arXiv:1812.11731 (2018).

[51]

T Kokilavani, Dr DI George Amalarethinam, et al. 2011. Load balanced min-min algorithm for static meta-task scheduling in grid computing. International Journal of Computer Applications 20, 2 (2011), 43--49.

[52]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009).

[53]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.

[54]

John Langford, Alexander J Smola, and Martin Zinkevich. 2009. Slow learners are fast. In NIPS.

[55]

Tan Le, Xiao Shu Sun, Mosharaf Chowdhury, and Zhenhua Liu. 2020. AlloX: compute allocation in hybrid clusters. Proceedings of the Fifteenth European Conference on Computer Systems (2020).

Digital Library

[56]

Ang Li. 2016. GPU performance modeling and optimization. Ph.D. Dissertation. Technische Universiteit Eindhoven.

[57]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In USENIX OSDI.

Digital Library

[58]

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. Efficient mini-batch training for stochastic optimization. In ACM KDD.

[59]

Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease. ml: towards multi-tenant resource sharing for machine learning workloads. VLDB (2018).

[60]

Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In IEEE HPCA.

[61]

Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2009. Identifying suspicious URLs: an application of large-scale online learning. In ACM ICML.

[62]

Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Exploiting GPU hardware saturation for fast compiler optimization. In ACM GPGPU.

[63]

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In NSDI.

[64]

Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In USENIX OSDI.

[65]

T Ozaki. 1977. On the order determination of ARIMA models. Applied Statistics (1977), 290--301.

[66]

Jay H Park, Gyeongchan Yun, Chang M Yi, Nguyen T Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In USENIX ATC.

[67]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In ACM Eurosys.

[68]

Nicholas G Polson and Vadim O Sokolov. 2017. Deep learning for short-term traffic flow prediction. Transportation Research Part C: Emerging Technologies 79 (2017), 1--17.

[69]

Thorbjörn Posewsky and Daniel Ziener. 2018. Throughput optimizations for FPGA-based deep neural network inference. Microprocessors and Microsystems 60 (2018), 151--161.

[70]

Akhter Mohiuddin Rather, Arun Agarwal, and VN Sastry. 2015. Recurrent neural network and a hybrid model for prediction of stock returns. Expert Systems with Applications 42, 6 (2015), 3234--3241.

Digital Library

[71]

Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In ACM SoCC.

[72]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.

Digital Library

[73]

Pooja Samal and Pranati Mishra. 2013. Analysis of variants in Round Robin Algorithms for load balancing in Cloud Computing. International Journal of computer science and Information Technologies 4, 3 (2013), 416--419.

[74]

Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX FAST.

[75]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In IEEE MSST.

[76]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.

[77]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE CVPR.

[78]

Xueyan Tang and Samuel T Chanson. 2000. Optimizing static job scheduling in a network of heterogeneous computers. In IEEE ICPP.

[79]

Asser N Tantawi and Don Towsley. 1985. Optimal static load balancing in distributed computer systems. Journal of the ACM (JACM) 32, 2 (1985), 445--465.

Digital Library

[80]

Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In ACM SIGMOD.

[81]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: introspective cluster scheduling for deep learning. In USENIX OSDI.

[82]

Jixiang Yang and Qingbi He. 2018. Scheduling parallel computations by work stealing: A survey. International Journal of Parallel Programming 46, 2 (2018), 173--197.

Digital Library

[83]

Yang Yang, De-Chuan Zhan, Ying Fan, Yuan Jiang, and Zhi-Hua Zhou. 2017. Deep Learning for Fixed Model Reuse. In AAAI. 2831--2837.

[84]

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation 26, 2 (2007), 289--315.

[85]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI.

[86]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In ACM SOSP.

[87]

Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In USENIX OSDI.

[88]

Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. SLAQ: quality-driven scheduling for distributed machine learning. In ACM SoCC.

[89]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX ATC.

[90]

Quan Zhang, Yang Song, Ramani R Routray, and Weisong Shi. 2016. Adaptive block and batch sizing for batched stream processing system. In IEEE ICAC.

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Assali TTrabelsi Ayoub ZOuni S(2025)Synergistic Distributed CNN Model for Protein Classification With a Collaborative BSP Synchronization Based on LSTM PredictionConcurrency and Computation: Practice and Experience10.1002/cpe.7002537:4-5Online publication date: 14-Feb-2025
https://doi.org/10.1002/cpe.70025
Um TOh BKang MLee WKim GKim DKim YMuzzammil MJeon MBagchi SZhang Y(2024)MetisProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692027(563-578)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692027
Show More Cited By

Index Terms

Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments
1. Computing methodologies
  1. Distributed computing methodologies
  2. Machine learning
    1. Machine learning algorithms

Recommendations

Adaptive Load Balancing Dashboard in Dynamic Distributed Systems

Considering the dynamic nature of new generation scientific problems, load balancing is a necessity to manage the load in an efficient manner. Load balancing systems are used to optimize the resource consumption, maximize the throughput, minimize ...
Dynamic load balancing algorithm for balancing the workload among virtual machine in cloud computing

Performance of the cloud infrastructure is highly depends upon the task scheduling and load balancing. Therefore number of load balancing algorithms and technique are proposed by researchers throughout the world whose aim is to distribute the workload ...
Load balancing in cloud computing: A big picture
Abstract
Scheduling or the allocation of user requests (tasks) in the cloud environment is an NP-hard optimization problem. According to the cloud infrastructure and the user requests, the cloud system is assigned with some load (that may be ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

October 2020

535 pages

ISBN:9781450381376

DOI:10.1145/3419111

General Chair:
Rodrigo Fonseca
Microsoft and Brown University
,
Program Chairs:
Christina Delimitrou
Cornell University
,
Beng Chin Ooi
National University of Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

RGC GRF

Conference

SoCC '20

Sponsor:

SoCC '20: ACM Symposium on Cloud Computing

October 19 - 21, 2020

Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Assali TTrabelsi Ayoub ZOuni S(2025)Synergistic Distributed CNN Model for Protein Classification With a Collaborative BSP Synchronization Based on LSTM PredictionConcurrency and Computation: Practice and Experience10.1002/cpe.7002537:4-5Online publication date: 14-Feb-2025
https://doi.org/10.1002/cpe.70025
Um TOh BKang MLee WKim GKim DKim YMuzzammil MJeon MBagchi SZhang Y(2024)MetisProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692027(563-578)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692027
Nie CMaghakian JLiu ZSchiavoni VEdinger JCao JJin Z(2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700767
Mo ZXu HXu CTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640375
Liu LDing ZCheng DZhou X(2024)Locality-Aware and Fault-Tolerant Batching for Machine Learning on Distributed DatasetsIEEE Transactions on Cloud Computing10.1109/TCC.2024.335171612:2(370-387)Online publication date: Apr-2024
https://doi.org/10.1109/TCC.2024.3351716
Hollingsworth KNian SGutierrez APadmanabhan A(2024)An Analysis of Network Overhead in Distributed TinyML2024 IEEE/ACM Symposium on Edge Computing (SEC)10.1109/SEC62691.2024.00051(449-455)Online publication date: 4-Dec-2024
https://doi.org/10.1109/SEC62691.2024.00051
Tairin SShen HIyer A(2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00204
Xiao YJu LZhou ZLi SHuan ZZhang DJiang RWang LZhang XLiang LZhou J(2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00394
Li HZhao HSun TLi XXu HLi K(2024)Interference-aware opportunistic job placement for shared distributed deep learning clustersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104776183(104776)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104776
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten