research-article

CEFS: compute-efficient flow scheduling for iterative synchronous applications

Authors:

Jiansong Zhang,

Wei LinAuthors Info & Claims

CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

Pages 136 - 148

https://doi.org/10.1145/3386367.3431307

Published: 24 November 2020 Publication History

Abstract

Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing task iteratively, with globally synchronizing the results in each iteration. To increase the scaling efficiency of ISApps, in this paper we propose a new flow scheduling approach, called CEFS. CEFS saves the waiting time of computing nodes from two aspects. For a single node, flows with data which can trigger earlier computation at the node are assigned with higher priority; among nodes, flows towards slower nodes are assigned with higher priority.

To address the challenges of realizing CEFS in real systems, e.g., the limited number of priority queues on commodity switches, the combination of two types of priorities, and the adaption to different applications and hardware environments, we design an online Bayesian optimization based priority assignment algorithm which meets a two-dimension order-preserving rule.

We implement a CEFS prototype and evaluate CEFS through both a 16-node GPU/RoCEv2 testbed by training typical DL models and NS-3 simulations. Compared with TensorFlow and two representative scheduling solutions: TicTac and ByteScheduler, CEFS improves the training throughput by up to 253%, 252% and 47%, respectively. Besides, the scaling efficiency of the 16-node system under TensorFlow, TicTac, ByteScheduler and CEFS is 26.6%~46.9%, 26.7%~47.0%, 63.9%~80.3%, and 92.9%~94.7%, respectively. The NS-3 simulation results show that CEFS can even achieve similar scaling efficiency at a larger scale.

Supplementary Material

MP4 File (3386367.3431307.mp4)

CEFS presentation

Download
272.72 MB

References

[1]

2016. NS3-RDMA. https://github.com/bobzhuyb/ns3-rdma. (2016).

[2]

2017. Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce. (2017).

[3]

2017. Horovod. https://github.com/horovod/horovod. (2017).

[4]

2019. bytedance/byteps. https://github.com/bytedance/byteps/tree/bytescheduler. (2019).

[5]

2019. TensorFlow Benchmarks. https://github.com/tensorflow/benchmarks. (2019).

[6]

2019. xldrx/tictac. https://github.com/xldrx/tictac. (2019).

[7]

2020. NCCL. https://developer.nvidia.com/nccl. (2020).

[8]

2020. NVLink. https://www.nvidia.com/en-us/data-center/nvlink/. (2020).

[9]

Dan Alistarh, Demjan Grubic, Jerry Li, and et al. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NIPS. 1709--1720.

[10]

Mohammad Alizadeh, Shuang Yang, Milad Sharif, and et al. 2013. pFabric: minimal near-optimal datacenter transport. In SIGCOMM. 435--446.

[11]

Wei Bai, Kai Chen, Shuihai Hu, and et al. 2017. Congestion control for high-speed extremely shallow-buffered datacenter networks. In APNet Workshop. 29--35.

[12]

Wei Bai, Li Chen, Kai Chen, and et al. 2015. Information-agnostic flow scheduling for commodity data centers. In NSDI. 455--468.

[13]

Rajarshi Biswas, Xiaoyi Lu, and Dhabaleswar K Panda. 2018. Accelerating tensorflow with adaptive rdma-based grpc. In HiPC. 2--11.

[14]

Chen Chen, Qizhen Weng, Wei Wang, and et al. 2018. Fast Distributed Deep Learning via Worker-adaptive Batch Sizing. arXiv:1806.02508 (2018).

[15]

Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. 2018. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In SIGCOMM. 191--205.

Digital Library

[16]

Tianqi Chen, Mu Li, Yutian Li, and et al. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015).

[17]

Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: A Networking Abstraction for Cluster Applications. In HotNets. 31--36.

Digital Library

[18]

Mosharaf Chowdhury and Ion Stoica. 2015. Efficient Coflow Scheduling Without Prior Knowledge. In SIGCOMM. 393--406.

[19]

Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2015. Efficient coflow scheduling with Varys. In SIGCOMM. 443--454.

[20]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2019).

[21]

Fahad R Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. 2015. Decentralized task-aware scheduling for data center networks. In SIGCOMM. 431--442.

[22]

Jianbo Dong, Zheng Cao, Tao Zhang, and et al. 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In HPCA. 610--622.

[23]

Jinkun Geng, Dan Li, Yang Cheng, and et al. 2018. HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning. In NetAI Workshop. 1--7.

[24]

Noah Golmant, Nikita Vemuri, Zhewei Yao, and et al. 2018. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent. arXiv:1811.12941 (2018).

[25]

Priya Goyal, Piotr Dollar, Ross Girshick, and et al. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 (2017).

[26]

Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, and et al. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In NSDI. 485--500.

[27]

Chuanxiong Guo, Haitao Wu, Zhong Deng, and et al. 2016. RDMA over commodity ethernet at scale. In SIGCOMM. 202--215.

[28]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In SysML.

[29]

Anand Jayarajan, Jinliang Wei, Garth Gibson, and et al. 2019. Priority-based Parameter Propagation for Distributed DNN Training. In SysML.

[30]

Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, and et al. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical Report. Microsoft Research.

[31]

Xianyan Jia, Shutao Song, Wei He, and et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 (2018).

[32]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-Aware Distributed Parameter Servers. In SIGMOD. 463--478.

[33]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, and et al. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 (2016).

[34]

Nouamane Laanait, Joshua Romero, Junqi Yin, and et al. 2019. Exascale Deep Learning for Scientific Inverse Problems. arXiv:1909.11150 (2019).

[35]

Mu Li, David G Andersen, Jun Woo Park, and et al. 2014. Scaling distributed machine learning with the parameter server. In OSDI. 583--598.

[36]

Ziyang Li, Wei Bai, Kai Chen, and et al. 2017. Rate-aware flow scheduling for commodity data center networks. In INFOCOM. 1--9.

[37]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2017. Asynchronous decentralized parallel stochastic gradient descent. arXiv:1710.06952 (2017).

[38]

Yunfeng Lu, Huaxi Gu, Xiaoshan Yu, and Krishnendu Chakrabarty. 2020. Lotus: A New Topology for Large-scale Distributed Machine Learning. JETC (2020), 1--21.

[39]

Liang Luo, Ming Liu, Jacob Nelson, and et al. 2017. Motivating in-network aggregation for distributed deep neural network training. In WAX Workshop.

[40]

Liang Luo, Jacob Nelson, Luis Ceze, and et al. 2018. Parameter Hub: A Rack-Scale Parameter Server for Distributed Deep Neural Network Training. In SoCC. 41--54.

Digital Library

[41]

Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware decentralized training. In ASPLOS. 893--907.

[42]

Brendan McMahan and Daniel Ramage. 2017. Federated Learning: Collaborative Machine Learning without Centralized Training Data. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html. (2017).

[43]

H. Brendan McMahan, Eider Moore, Daniel Ramage, and et al. 2016. Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 (2016).

[44]

Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, and et al. 2018. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. arXiv:1811.05233 (2018).

[45]

Jonas Mockus. 1994. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization (1994), 347--365.

[46]

Ali Munir, Ghufran Baig, Syed M. Irteza, and et al. 2014. Friends, Not Foes: Synthesizing Existing Transport Strategies for Data Center Networks. In SIGCOMM. 491--502.

[47]

Adam Paszke, Sam Gross, Francisco Massa, and et. al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NIPS. 8024--8035.

Digital Library

[48]

Yanghua Peng, Yixin Bao, Yangrui Chen, and et al. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In EuroSys. 1--14.

[49]

Yanghua Peng, Yibo Zhu, Yangrui Chen, and et al. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In SOSP. 16--29.

[50]

Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, and et al. 2017. In-Network Computation is a Dumb Idea Whose Time Has Come. In HotNets.

[51]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, and et al. 2019. Scaling distributed machine learning with in-network aggregation. arXiv:1903.06701 (2019).

[52]

Frank Seide, Hao Fu, Jasha Droppo, and et al. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech.

[53]

Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms. In INFOCOM. 172--180.

[54]

C. Szegedy, V. Vanhoucke, S. Ioffe, and et al. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818--2826.

[55]

Leslie G Valiant. 1990. A bridging model for parallel computation. Commun. ACM (1990), 103--111.

[56]

Shuai Wang, Dan Li, and Jinkun Geng. 2020. Geryon: Accelerating distributed cnn training by network-level flow scheduling. In INFOCOM. 1678--1687.

[57]

Shuai Wang, Dan Li, Jinkun Geng, and et al. 2019. Impact of network topology on the performance of DML: Theoretical analysis and practical factors. In INFOCOM. 1729--1737.

[58]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, and et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In OSDI. 595--610.

[59]

Fei Xu, Fangming Liu, and Hai Jin. 2016. Heterogeneity and Interference-Aware Virtual Machine Provisioning for Predictable Performance in the Cloud. TC (2016), 2470--2483.

[60]

Zhilin Yang, Zihang Dai, Yiming Yang, and et al. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 (2019).

[61]

Bairen Yi, Jiacheng Xia, Li Chen, and Kai Chen. 2017. Towards zero copy dataflows using rdma. In SIGCOMM Posters and Demos. 28--30.

[62]

Yang You, Zhao Zhang, Cho-Jui Hsieh, and et al. 2018. ImageNet Training in Minutes. In ICPP. 1--10.

[63]

Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In SoCC. 390--404.

[64]

Hao Zhang, Zeyu Zheng, Shizhen Xu, and et al. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In ATC. 181--193.

[65]

Shuchang Zhou, Yuxin Wu, Zekun Ni, and et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 (2016).

Cited By

Gao YHu BMashhadi MJin AXiao PWu C(2024)US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333137235:1(123-139)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3331372
Zhao XWu CZhu X(2024)Dynamic Flow Scheduling for DNN Training Workloads in Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345067021:6(6643-6657)Online publication date: Dec-2024
https://doi.org/10.1109/TNSM.2024.3450670
Li YLi WYao YDu YLi K(2024)Host-driven In-Network Aggregation on RDMAIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621230(1051-1060)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621230
Show More Cited By

Index Terms

CEFS: compute-efficient flow scheduling for iterative synchronous applications
1. Computing methodologies
  1. Machine learning
2. Networks
  1. Network services
    1. Cloud computing

Recommendations

Scheduling Mix-flows in Commodity Datacenters with Karuna
SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference

Cloud applications generate a mix of flows with and without deadlines. Scheduling such mix-flows is a key challenge; our experiments show that trivially combining existing schemes for deadline/non-deadline flows is problematic. For example, prioritizing ...
ResQueue: A Smarter Datacenter Flow Scheduler
WWW '20: Proceedings of The Web Conference 2020

Datacenters host a mix of applications: foreground applications perform distributed lookups in order to service user queries and background applications perform batch processing tasks such as data reorganization, backup, and replication. While ...
QCluster: Clustering Packets for Flow Scheduling
WWW '22: Proceedings of the ACM Web Conference 2022

Flow scheduling is crucial in data centers, as it directly influences user experience of applications. According to different assumptions and design goals, there are four typical flow scheduling problems/solutions: SRPT, LAS, Fair Queueing, and Deadline-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

November 2020

585 pages

ISBN:9781450379489

DOI:10.1145/3386367

Program Chairs:
Dongsu Han
KAIST, South Korea
,
Anja Feldmann
MPI, Germany

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

CoNEXT '20

Sponsor:

SIGCOMM

CoNEXT '20: The 16th International Conference on emerging Networking EXperiments and Technologies

December 1 - 4, 2020

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gao YHu BMashhadi MJin AXiao PWu C(2024)US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333137235:1(123-139)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3331372
Zhao XWu CZhu X(2024)Dynamic Flow Scheduling for DNN Training Workloads in Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345067021:6(6643-6657)Online publication date: Dec-2024
https://doi.org/10.1109/TNSM.2024.3450670
Li YLi WYao YDu YLi K(2024)Host-driven In-Network Aggregation on RDMAIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621230(1051-1060)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621230
Wang XWang SLi D(2023)sRDMA: A General and Low-Overhead Scheduler for RDMAProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600082(21-27)Online publication date: 29-Jun-2023
https://dl.acm.org/doi/10.1145/3600061.3600082
Duan QPeng CWang ZXu YLiu SWu JLui J(2023)Accelerating Distributed DNN Training via Transport Layer SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325046234:5(1650-1666)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3250462
Gao YZhang ZHu BJin AWu C(2023)OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learningParallel Computing10.1016/j.parco.2023.103053118(103053)Online publication date: Nov-2023
https://doi.org/10.1016/j.parco.2023.103053
Li YHuang JLi ZZhou SJiang WWang J(2022)HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep LearningProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545024(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545024
Wang SGeng JLi D(2022)Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical TopologyIEEE/ACM Transactions on Networking10.1109/TNET.2021.311704230:2(572-585)Online publication date: Apr-2022
https://doi.org/10.1109/TNET.2021.3117042
Guo ZWang JLiu SRen JXu YYao C(2022)Congestion-aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2022.3197350(1-17)Online publication date: 2022
https://doi.org/10.1109/TCC.2022.3197350
Duan QWang ZXu YLiu SWu J(2022)Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN TrainingIEEE INFOCOM 2022 - IEEE Conference on Computer Communications10.1109/INFOCOM48880.2022.9796820(350-359)Online publication date: 2-May-2022
https://doi.org/10.1109/INFOCOM48880.2022.9796820
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten