skip to main content
10.1145/3386367.3431307acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

CEFS: compute-efficient flow scheduling for iterative synchronous applications

Published: 24 November 2020 Publication History

Abstract

Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing task iteratively, with globally synchronizing the results in each iteration. To increase the scaling efficiency of ISApps, in this paper we propose a new flow scheduling approach, called CEFS. CEFS saves the waiting time of computing nodes from two aspects. For a single node, flows with data which can trigger earlier computation at the node are assigned with higher priority; among nodes, flows towards slower nodes are assigned with higher priority.
To address the challenges of realizing CEFS in real systems, e.g., the limited number of priority queues on commodity switches, the combination of two types of priorities, and the adaption to different applications and hardware environments, we design an online Bayesian optimization based priority assignment algorithm which meets a two-dimension order-preserving rule.
We implement a CEFS prototype and evaluate CEFS through both a 16-node GPU/RoCEv2 testbed by training typical DL models and NS-3 simulations. Compared with TensorFlow and two representative scheduling solutions: TicTac and ByteScheduler, CEFS improves the training throughput by up to 253%, 252% and 47%, respectively. Besides, the scaling efficiency of the 16-node system under TensorFlow, TicTac, ByteScheduler and CEFS is 26.6%~46.9%, 26.7%~47.0%, 63.9%~80.3%, and 92.9%~94.7%, respectively. The NS-3 simulation results show that CEFS can even achieve similar scaling efficiency at a larger scale.

Supplementary Material

MP4 File (3386367.3431307.mp4)
CEFS presentation

References

[1]
2016. NS3-RDMA. https://github.com/bobzhuyb/ns3-rdma. (2016).
[2]
2017. Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce. (2017).
[3]
2017. Horovod. https://github.com/horovod/horovod. (2017).
[4]
2019. bytedance/byteps. https://github.com/bytedance/byteps/tree/bytescheduler. (2019).
[5]
2019. TensorFlow Benchmarks. https://github.com/tensorflow/benchmarks. (2019).
[6]
2019. xldrx/tictac. https://github.com/xldrx/tictac. (2019).
[7]
2020. NCCL. https://developer.nvidia.com/nccl. (2020).
[8]
2020. NVLink. https://www.nvidia.com/en-us/data-center/nvlink/. (2020).
[9]
Dan Alistarh, Demjan Grubic, Jerry Li, and et al. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NIPS. 1709--1720.
[10]
Mohammad Alizadeh, Shuang Yang, Milad Sharif, and et al. 2013. pFabric: minimal near-optimal datacenter transport. In SIGCOMM. 435--446.
[11]
Wei Bai, Kai Chen, Shuihai Hu, and et al. 2017. Congestion control for high-speed extremely shallow-buffered datacenter networks. In APNet Workshop. 29--35.
[12]
Wei Bai, Li Chen, Kai Chen, and et al. 2015. Information-agnostic flow scheduling for commodity data centers. In NSDI. 455--468.
[13]
Rajarshi Biswas, Xiaoyi Lu, and Dhabaleswar K Panda. 2018. Accelerating tensorflow with adaptive rdma-based grpc. In HiPC. 2--11.
[14]
Chen Chen, Qizhen Weng, Wei Wang, and et al. 2018. Fast Distributed Deep Learning via Worker-adaptive Batch Sizing. arXiv:1806.02508 (2018).
[15]
Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. 2018. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In SIGCOMM. 191--205.
[16]
Tianqi Chen, Mu Li, Yutian Li, and et al. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015).
[17]
Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: A Networking Abstraction for Cluster Applications. In HotNets. 31--36.
[18]
Mosharaf Chowdhury and Ion Stoica. 2015. Efficient Coflow Scheduling Without Prior Knowledge. In SIGCOMM. 393--406.
[19]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2015. Efficient coflow scheduling with Varys. In SIGCOMM. 443--454.
[20]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2019).
[21]
Fahad R Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. 2015. Decentralized task-aware scheduling for data center networks. In SIGCOMM. 431--442.
[22]
Jianbo Dong, Zheng Cao, Tao Zhang, and et al. 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In HPCA. 610--622.
[23]
Jinkun Geng, Dan Li, Yang Cheng, and et al. 2018. HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning. In NetAI Workshop. 1--7.
[24]
Noah Golmant, Nikita Vemuri, Zhewei Yao, and et al. 2018. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent. arXiv:1811.12941 (2018).
[25]
Priya Goyal, Piotr Dollar, Ross Girshick, and et al. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 (2017).
[26]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, and et al. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In NSDI. 485--500.
[27]
Chuanxiong Guo, Haitao Wu, Zhong Deng, and et al. 2016. RDMA over commodity ethernet at scale. In SIGCOMM. 202--215.
[28]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In SysML.
[29]
Anand Jayarajan, Jinliang Wei, Garth Gibson, and et al. 2019. Priority-based Parameter Propagation for Distributed DNN Training. In SysML.
[30]
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, and et al. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Technical Report. Microsoft Research.
[31]
Xianyan Jia, Shutao Song, Wei He, and et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 (2018).
[32]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-Aware Distributed Parameter Servers. In SIGMOD. 463--478.
[33]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, and et al. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 (2016).
[34]
Nouamane Laanait, Joshua Romero, Junqi Yin, and et al. 2019. Exascale Deep Learning for Scientific Inverse Problems. arXiv:1909.11150 (2019).
[35]
Mu Li, David G Andersen, Jun Woo Park, and et al. 2014. Scaling distributed machine learning with the parameter server. In OSDI. 583--598.
[36]
Ziyang Li, Wei Bai, Kai Chen, and et al. 2017. Rate-aware flow scheduling for commodity data center networks. In INFOCOM. 1--9.
[37]
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2017. Asynchronous decentralized parallel stochastic gradient descent. arXiv:1710.06952 (2017).
[38]
Yunfeng Lu, Huaxi Gu, Xiaoshan Yu, and Krishnendu Chakrabarty. 2020. Lotus: A New Topology for Large-scale Distributed Machine Learning. JETC (2020), 1--21.
[39]
Liang Luo, Ming Liu, Jacob Nelson, and et al. 2017. Motivating in-network aggregation for distributed deep neural network training. In WAX Workshop.
[40]
Liang Luo, Jacob Nelson, Luis Ceze, and et al. 2018. Parameter Hub: A Rack-Scale Parameter Server for Distributed Deep Neural Network Training. In SoCC. 41--54.
[41]
Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware decentralized training. In ASPLOS. 893--907.
[42]
Brendan McMahan and Daniel Ramage. 2017. Federated Learning: Collaborative Machine Learning without Centralized Training Data. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html. (2017).
[43]
H. Brendan McMahan, Eider Moore, Daniel Ramage, and et al. 2016. Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 (2016).
[44]
Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, and et al. 2018. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. arXiv:1811.05233 (2018).
[45]
Jonas Mockus. 1994. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization (1994), 347--365.
[46]
Ali Munir, Ghufran Baig, Syed M. Irteza, and et al. 2014. Friends, Not Foes: Synthesizing Existing Transport Strategies for Data Center Networks. In SIGCOMM. 491--502.
[47]
Adam Paszke, Sam Gross, Francisco Massa, and et. al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NIPS. 8024--8035.
[48]
Yanghua Peng, Yixin Bao, Yangrui Chen, and et al. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In EuroSys. 1--14.
[49]
Yanghua Peng, Yibo Zhu, Yangrui Chen, and et al. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In SOSP. 16--29.
[50]
Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, and et al. 2017. In-Network Computation is a Dumb Idea Whose Time Has Come. In HotNets.
[51]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, and et al. 2019. Scaling distributed machine learning with in-network aggregation. arXiv:1903.06701 (2019).
[52]
Frank Seide, Hao Fu, Jasha Droppo, and et al. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech.
[53]
Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms. In INFOCOM. 172--180.
[54]
C. Szegedy, V. Vanhoucke, S. Ioffe, and et al. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818--2826.
[55]
Leslie G Valiant. 1990. A bridging model for parallel computation. Commun. ACM (1990), 103--111.
[56]
Shuai Wang, Dan Li, and Jinkun Geng. 2020. Geryon: Accelerating distributed cnn training by network-level flow scheduling. In INFOCOM. 1678--1687.
[57]
Shuai Wang, Dan Li, Jinkun Geng, and et al. 2019. Impact of network topology on the performance of DML: Theoretical analysis and practical factors. In INFOCOM. 1729--1737.
[58]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, and et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In OSDI. 595--610.
[59]
Fei Xu, Fangming Liu, and Hai Jin. 2016. Heterogeneity and Interference-Aware Virtual Machine Provisioning for Predictable Performance in the Cloud. TC (2016), 2470--2483.
[60]
Zhilin Yang, Zihang Dai, Yiming Yang, and et al. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 (2019).
[61]
Bairen Yi, Jiacheng Xia, Li Chen, and Kai Chen. 2017. Towards zero copy dataflows using rdma. In SIGCOMM Posters and Demos. 28--30.
[62]
Yang You, Zhao Zhang, Cho-Jui Hsieh, and et al. 2018. ImageNet Training in Minutes. In ICPP. 1--10.
[63]
Haoyu Zhang, Logan Stafman, Andrew Or, and Michael J Freedman. 2017. Slaq: quality-driven scheduling for distributed machine learning. In SoCC. 390--404.
[64]
Hao Zhang, Zeyu Zheng, Shizhen Xu, and et al. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In ATC. 181--193.
[65]
Shuchang Zhou, Yuxin Wu, Zekun Ni, and et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 (2016).

Cited By

View all
  • (2024)US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333137235:1(123-139)Online publication date: Jan-2024
  • (2024)Dynamic Flow Scheduling for DNN Training Workloads in Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345067021:6(6643-6657)Online publication date: Dec-2024
  • (2024)Host-driven In-Network Aggregation on RDMAIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621230(1051-1060)Online publication date: 20-May-2024
  • Show More Cited By

Index Terms

  1. CEFS: compute-efficient flow scheduling for iterative synchronous applications

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies
      November 2020
      585 pages
      ISBN:9781450379489
      DOI:10.1145/3386367
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 November 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. colud computing
      2. datacenter networks
      3. flow scheduling
      4. iterative synchronization applications

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      CoNEXT '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 198 of 789 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)82
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.333137235:1(123-139)Online publication date: Jan-2024
      • (2024)Dynamic Flow Scheduling for DNN Training Workloads in Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345067021:6(6643-6657)Online publication date: Dec-2024
      • (2024)Host-driven In-Network Aggregation on RDMAIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621230(1051-1060)Online publication date: 20-May-2024
      • (2023)sRDMA: A General and Low-Overhead Scheduler for RDMAProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600082(21-27)Online publication date: 29-Jun-2023
      • (2023)Accelerating Distributed DNN Training via Transport Layer SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325046234:5(1650-1666)Online publication date: May-2023
      • (2023)OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learningParallel Computing10.1016/j.parco.2023.103053118(103053)Online publication date: Nov-2023
      • (2022)HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep LearningProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545024(1-11)Online publication date: 29-Aug-2022
      • (2022)Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical TopologyIEEE/ACM Transactions on Networking10.1109/TNET.2021.311704230:2(572-585)Online publication date: Apr-2022
      • (2022)Congestion-aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2022.3197350(1-17)Online publication date: 2022
      • (2022)Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN TrainingIEEE INFOCOM 2022 - IEEE Conference on Computer Communications10.1109/INFOCOM48880.2022.9796820(350-359)Online publication date: 2-May-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media