Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Qi, Qiang; Xu, Fei; Chen, Li; Zhou, Zhi

doi:10.1007/s42514-021-00064-x

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Regular Paper
Published: 02 March 2021

Volume 3, pages 171–185, (2021)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Qiang Qi¹,
Fei Xu¹,
Li Chen² &
…
Zhi Zhou³

316 Accesses
1 Citation
Explore all metrics

Abstract

Distributed deep neural network (DDNN) training becomes increasingly compelling as the DNN model gets complex and the dataset grows large. Through an in-depth analysis of the latest Microsoft GPU cluster trace, we show that the co-located Parameter Server (PS) configuration is not uncommon in production DDNN training clusters, which inevitably causes intense network resource contention among the co-located PS and worker tasks. Our motivation experiments on Amazon EC2 further show that such network resource contention brings severe performance variation to DDNN training jobs. While existing works largely mitigate the inter-job network resource contention, the intra-job (i.e., task-level) network resource contention among the co-located PS and worker tasks has received comparably little attention. To tackle such performance issues, in this paper, we design and implement Nebula, a Network bandwidth resource allocation strategy for DDNN training tasks, in order to mitigate the network resource contention and alleviate the performance variation of DDNN training jobs. Nebula monitors the weights of co-located PS and workers and rations the network bandwidth resources for the two tasks by comparing the corresponding task weights. We implement a prototype of Nebula and conduct extensive prototype experiments with representative DNN models trained on Amazon EC2. Our experiment results demonstrate that Nebula can reduce the iteration time of a DDNN training job by up to 25% and improve the cluster resource utilization by up to 30% in comparison to MXNet, yet with practically acceptable runtime overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

A Hybrid Machine Learning Model for Code Optimization

Article 22 September 2023

Notes

We consider the iteration time as the difference of end time of pull operations for two adjacent iterations (Zhang et al. 2017).
https://github.com/bytedance/byteps.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. Proc. USENIX OSDI 2016, 265–283 (2016)
Google Scholar
Berral JL, Wang C, Youssef A (2020) AI4DL: mining behaviors of deep learning workloads for resource management. In: Proceedings of USENIX HotCloud (2020)
Chen, C., Wang, W., Li, B.: Round-Robin synchronization: mitigating communication Bottlenecks in parameter servers. Proc IEEE INFOCOM 2019, 532–540 (2019)
Google Scholar
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015). arXiv preprint arXiv:151201274
Gu, J., Chowdhury, M., Shin, K.G., Zhu, Y., Jeon, M., Qian, J., Liu, H., Guo, C.: Tiresias: a GPU cluster manager for distributed deep learning. Proc. USENIX NSDI 2019, 485–500 (2019)
Google Scholar
Guo, J., Liu, F., Lui, J.C.S., Jin, H.: Fair network bandwidth allocation in iaas datacenters via a cooperative game approach. IEEE/ACM Trans. Netw. 24, 873–886 (2015)
Article Google Scholar
Guptaand, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. Proc. ICML 2015, 1737–1746 (2015)
Google Scholar
Huang, X.S., Chen, A., Ng, T.: Green, yellow, yield: end-host traffic scheduling for distributed deep learning with tensorlights. Proc. IEEE IPDPSW 2019, 430–437 (2019)
Google Scholar
Jayarajan, A., Wei, J., Gibson, G., Fedorova, A., Pekhimenko, G.: Priority-based parameter propagation for distributed DNN training. In: Talwalkar, A., Smith, V., Zaharia, M. (eds.) Proceedings of Machine Learning and Systems 2019, vol. 3, pp. 132–145 (2019)
Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., Yang, F.: Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. Proc. USENIX ATC 2019, 947–960 (2019)
Google Scholar
Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: Proceedings of USENIX OSDI, pp 463–479 (2020)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Proc. NIPS 2012, 1097–1105 (2012)
Google Scholar
Kshiteej, M., Arjun, B., Arjun, S., Shivaram, V., Aditya, A., Amar, P., Shuchi, C.: Themis: fair and efficient GPU cluster scheduling. Proc. USENIX NSDI 2020, 289–304 (2020)
Google Scholar
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. (2017) arXiv preprint arXiv:171201887
Luo, L., Nelson, J., Ceze, L., Phanishayee, A., Krishnamurthy, A.: Parameter hub: a rack-scale parameter server for distributed deep neural network training. Proc. ACM SOCC 2018, 41–54 (2018)
Google Scholar
Luo L, West P, Krishnamurthy A, Ceze L, Nelson J (2020) PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training. Proc. of MLSys 2020
Mai, L., Hong, C., Costa, P. (2015) Optimizing network performance in distributed machine learning. In: Proceedings of USENIX HotCloud 2015
Mayer, R., Jacobsen, H.A.: Scalable deep learning on distributed infrastructures: challenges, techniques, and tools. ACM Comput Surv (CSUR) 53, 1–37 (2020)
Article Google Scholar
Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. Proc. ICML 2017, 2430–2439 (2017)
Google Scholar
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N.R., Ganger, G.R., Gibbons, P.B., Zaharia, M.: PipeDream: generalized pipeline parallelism for DNN training. Proc. ACM SOSP 2019, 1–15 (2019)
Google Scholar
Panayiotou, T., Manousakis, K., Chatzis, S.P., Ellinas, G.: A data-driven bandwidth allocation framework With QoS considerations for EONs. J Lightwave Technol 37, 1853–1864 (2019)
Article Google Scholar
Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. Proc. EuroSys 2018, 1–14 (2018)
Google Scholar
Peng, Y., Zhu, Y., Chen, Y., Bao, Y., Yi, B., Lan, C., Wu, C., Guo, C.: A generic communication scheduler for distributed DNN training acceleration. Proc. ACM SOSP 2019, 16–29 (2019)
Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, New York (2020)
MATH Google Scholar
Shen, D., Luo, J., Dong, F., Jin, J., Zhang, J., Shen, J.: Facilitating Application-Aware Bandwidth Allocation in the Cloud with One-Step-Ahead Traffic Information. IEEE Trans. Serv. Comput. 13, 381–394 (2020)
Google Scholar
Shi, S., Chu, X., Li, B.: MG-WFBP: efficient data communication for distributed synchronous SGD algorithms. Proc. IEEE INFOCOM 2019, 172–180 (2019)
Google Scholar
Shi, S., Wang, Q., Chu, X., Li, B., Qin, Y., Liu, R., Zhao, X.: Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: Proceedings of IEEE INFOCOM 2020 (2020)
Ukidave, Y., Li, X., Kaeli, D.: Mystic: predictive scheduling for Gpu based cloud servers using machine learning. Proc. IEEE IPDPS 2016, 353–362 (2016)
Google Scholar
Wang, C., Zhang, S., Chen, Y., Qian, Z., Wu, J., Xiao, M.: Joint configuration adaptation and bandwidth allocation for edge-based real-time video analytics. Proc. IEEE INFOCOM 2020, 1–10 (2020)
Google Scholar
Wang, Q., Shi, S., Wang, C., Chu, X. Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs. (2020b). arXiv preprint arXiv:200210105
Wang, S., Li, D., Geng, J.: Geryon: accelerating distributed CNN training by network-level flow scheduling. Proc. IEEE INFOCOM 2020, 1678–1687 (2020)
Google Scholar
Xu, F., Ye, W., Liu, Y., Zhang, W.: Ufalloc: towards utility max-min fairness of bandwidth allocation for applications in datacenter networks. Mobile Netw. Appl. 22, 161–173 (2017)
Article Google Scholar
Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. Proc. USENIX ATC 2017, 181–193 (2017)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the NSFC under grant No.61972158, in part by the Science and Technology Commission of Shanghai Municipality under grant No.20511102802 and No.18DZ2270800, and in part by the Tencent Corporation. Li Chen’s work was supported by a grant from BoRSF-RCS under the contract LEQSF(2019-22)-RD-A-21. Zhi Zhou’s work was supported in part by the NSFC under grant No.61802449.

Author information

Authors and Affiliations

Shanghai Key Laboratory of Multidimensional Information Processing, School of Computer Science and Technology, East China Normal University, Shanghai, China
Qiang Qi & Fei Xu
School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, USA
Li Chen
Guangdong Key Laboratory of Big Data Analysis and Processing, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Zhi Zhou

Authors

Qiang Qi
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qi, Q., Xu, F., Chen, L. et al. Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters. CCF Trans. HPC 3, 171–185 (2021). https://doi.org/10.1007/s42514-021-00064-x

Download citation

Received: 31 August 2020
Accepted: 06 February 2021
Published: 02 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s42514-021-00064-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Bolstering stochastic gradient descent with model building

A Hybrid Machine Learning Model for Code Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Bolstering stochastic gradient descent with model building

A Hybrid Machine Learning Model for Code Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation