skip to main content
10.1145/3582016.3582037acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

In-Network Aggregation with Transport Transparency for Distributed Training

Published:25 March 2023Publication History

ABSTRACT

Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation.

We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

References

  1. Barefoot. 2019. TOFINO: World’s fastest P4-programmable Ethernet switch ASICs. https://barefootnetworks.com/products/brief-tofino/ Google ScholarGoogle Scholar
  2. Mike Barnett, Lance Shuler, Robert van De Geijn, Satya Gupta, David G Payne, and Jerrell Watts. 1994. Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference. 357–364. https://ieeexplore.ieee.org/abstract/document/296665 Google ScholarGoogle ScholarCross RefCross Ref
  3. Theophilus A Benson. 2019. In-network compute: Considered armed and dangerous. In Proceedings of the Workshop on Hot Topics in Operating Systems. 216–224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Li Chen, Ge Chen, Justinas Lingys, and Kai Chen. 2018. Programmable switch as a parallel computing device. arXiv preprint arXiv:1803.01491. Google ScholarGoogle Scholar
  5. Xiang Chen, Qun Huang, Peiqiao Wang, Zili Meng, Hongyan Liu, Yuxin Chen, Dong Zhang, Haifeng Zhou, Boyang Zhou, and Chunming Wu. 2021. LightNF: Simplifying Network Function Offloading in Programmable Networks. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1–10. Google ScholarGoogle Scholar
  6. Eyal Cidon, Sean Choi, Sachin Katti, and Nick McKeown. 2017. AppSwitch: Application-Layer Load Balancing within a Software Switch. In Proceedings of the First Asia-Pacific Workshop on Networking (APNet’17). Association for Computing Machinery, New York, NY, USA. 64–70. isbn:9781450352444 https://doi.org/10.1145/3106989.3106998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, arxiv:1803.05880 Google ScholarGoogle Scholar
  8. Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 248–255. https://www.image-net.org/papers/imagenet_cvpr09.pdf Google ScholarGoogle ScholarCross RefCross Ref
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, arxiv:1810.04805 Google ScholarGoogle Scholar
  11. Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. 2012. High performance network virtualization with SR-IOV. J. Parallel and Distrib. Comput., 72, 11 (2012), 1471–1480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yaozu Dong, Zhao Yu, and Greg Rose. 2008. SR-IOV Networking in Xen: Architecture, Design and Implementation.. In Workshop on I/O Virtualization. 2. Google ScholarGoogle Scholar
  13. Jiawei Fei, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning. In Proceedings of SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network Aggregation for Shared Machine Learning Clusters. Proceedings of Machine Learning and Systems, 3 (2021), 829–844. Google ScholarGoogle Scholar
  15. Jinkun Geng, Dan Li, and Shuai Wang. 2019. Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 100–111. Google ScholarGoogle ScholarCross RefCross Ref
  16. Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, and Vladimir Koushnir. 2016. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10. https://ieeexplore.ieee.org/abstract/document/7830486/ Google ScholarGoogle ScholarCross RefCross Ref
  17. Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, and Ophir Maor. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Streaming-Aggregation Hardware Design and Evaluation. In International Conference on High Performance Computing. 41–59. https://link.springer.com/chapter/10.1007/978-3-030-50743-5_3 Google ScholarGoogle Scholar
  18. Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity Ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference. 202–215. https://dl.acm.org/doi/pdf/10.1145/2934872.2934908 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. 2015. SoftNIC: A software NIC to augment hardware. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155. Google ScholarGoogle Scholar
  20. Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288. Google ScholarGoogle Scholar
  21. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf Google ScholarGoogle ScholarCross RefCross Ref
  22. Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. 2020. MasQ: RDMA for Virtual Private Cloud. SIGCOMM ’20. Association for Computing Machinery, New York, NY, USA. 1–14. isbn:9781450379557 https://doi.org/10.1145/3387514.3405849 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Huggingface. 2020. Transformers:State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0. https://github.com/huggingface/transformers Google ScholarGoogle Scholar
  24. Sylvain Jeaugey. 2017. NCCL 2.0. http://on-demand.gputechconf .com/gtc/2017/ presentation/s7155-jeaugey-nccl.pdf Google ScholarGoogle Scholar
  25. Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu, and Mengxian Chi. 2018. Improving the performance of distributed tensorflow with RDMA. International Journal of Parallel Programming, 46, 4 (2018), 674–685. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, and Liwei Yu. 2018. Highly scalable deep learning training system with mixed-precision: Training Imagenet in four minutes. arXiv preprint arXiv:1807.11205, arxiv:1807.11205 Google ScholarGoogle Scholar
  27. Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-RTT Coordination. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA. 35–49. isbn:978-1-939133-01-4 https://www.usenix.org/conference/nsdi18/presentation/jin Google ScholarGoogle Scholar
  28. Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 121–136. isbn:9781450350853 https://doi.org/10.1145/3132747.3132764 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi. 2021. SiP-ML: high-bandwidth optical network interconnects for machine learning training. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 657–675. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, and Srinivasan Seshan. 2018. Hyperloop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA. 297–312. isbn:9781450355674 https://doi.org/10.1145/3230543.3230572 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA. 113–126. isbn:978-1-931971-49-2 https://www.usenix.org/conference/nsdi19/presentation/kim Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 996–1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarGoogle Scholar
  34. Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, and Nate Foster. 2019. PicNIC: predictable virtualized NIC. In Proceedings of the ACM Special Interest Group on Data Communication. 351–366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 741–761. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/lao Google ScholarGoogle Scholar
  36. Alberto Lerner, Rana Hussein, Philippe Cudre-Mauroux, and U eXascale Infolab. 2019. The Case for Network Accelerated Query Processing.. In CIDR. Google ScholarGoogle Scholar
  37. Mingfan Li, Ke Wen, Han Lin, Xu Jin, Zheng Wu, Hong An, and Mengxian Chi. 2019. Improving the performance of distributed mxnet with rdma. International Journal of Parallel Programming, 47, 3 (2019), 467–480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279–291. https://ieeexplore.ieee.org/abstract/document/8980345 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. 2019. DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA. 143–157. isbn:978-1-939133-09-0 https://www.usenix.org/conference/fast19/presentation/liu Google ScholarGoogle Scholar
  40. Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze, and Jacob Nelson. 2020. PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training. Proc. of MLSys. Google ScholarGoogle Scholar
  41. Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and efficient $GPU$ cluster scheduling. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 289–304. Google ScholarGoogle Scholar
  42. Mellanox. 2022. ConnectX-5 EN Single/Dual-Port Adapter Supporting 100Gb/s Ethernet. https://www.mellanox.com/products/ethernet-adapters/connectx-5-en Google ScholarGoogle Scholar
  43. Mellanox. 2022. InfiniBand Switch Silicon: Mellanox Quantum. https://www.mellanox.com/products/infiniband-switches-ic/quantum Google ScholarGoogle Scholar
  44. Jeffrey C Mogul. 2003. TCP Offload Is a Dumb Idea Whose Time Has Come.. In HotOS. 25–30. Google ScholarGoogle Scholar
  45. Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. 2019. Jumpgate: In-network processing as a service for data analytics. In 11th $USENIX$ Workshop on Hot Topics in Cloud Computing (HotCloud 19). Google ScholarGoogle Scholar
  46. Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 481–498. Google ScholarGoogle Scholar
  47. NVIDIA. 2017. NVIDIA DGX-1 with Tesla V100 System Architecture. https://www.nvidia.com/en-us/data-center/resources/dgx-1-system-architecture-whitepaper/ Google ScholarGoogle Scholar
  48. NVIDIA. 2019. NCCL: Optimized primitives for collective multi-GPU communication. https://github.com/NVIDIA/nccl Google ScholarGoogle Scholar
  49. NVIDIA. 2019. NVIDIA NVLink Fabric. https://www.nvidia.com/en-sg/data-center/nvlink/ Google ScholarGoogle Scholar
  50. NVIDIA. 2020. NVIDIA V100: The First Tensor Core GPU. https://www.nvidia.com/en-sg/data-center/v100/ Google ScholarGoogle Scholar
  51. NVIDIA. 2021. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl Google ScholarGoogle Scholar
  52. NVIDIA. 2023. GeForce RTX 2080. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080/ Google ScholarGoogle Scholar
  53. Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29. https://dl.acm.org/doi/abs/10.1145/3341301.3359642 Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rolf Rabenseifner. 1997. A new optimized MPI reduce algorithm. https://fs.hlrs.de/projects/par/mpi//myreduce.html Google ScholarGoogle Scholar
  55. Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. Better language models and their implications. OpenAI Blog, https://openai. com/blog/better-language-models Google ScholarGoogle Scholar
  56. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQUAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, arxiv:1606.05250 Google ScholarGoogle Scholar
  57. Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel Hack, and Song Jiang. 2017. irdma: Efficient use of rdma in distributed deep learning systems. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 231–238. Google ScholarGoogle Scholar
  58. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 785–808. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/sapio Google ScholarGoogle Scholar
  59. Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, arxiv:1802.05799 Google ScholarGoogle Scholar
  60. Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. 172–180. arxiv:1811.11141 Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Jinwoo Shin and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning. Google ScholarGoogle Scholar
  62. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, arxiv:1909.08053 Google ScholarGoogle Scholar
  63. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, arxiv:1409.1556 Google ScholarGoogle Scholar
  64. Brent E Stephens, Darius Grassi, Hamidreza Almasi, Tao Ji, Balajee Vamanan, and Aditya Akella. 2021. TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP). In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks. 61–68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. PyTorch Team. 2023. PyTorch. https://github.com/pytorch/pytorch Google ScholarGoogle Scholar
  66. TensorFlow. 2019. A benchmark framework for Tensorflow. https://github.com/tensorflow/benchmarks Google ScholarGoogle Scholar
  67. Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. 2020. Cheetah: Accelerating Database Queries with Switch Pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2407–2422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Raajay Viswanathan, Arjun Balasubramanian, and Aditya Akella. 2020. Network-accelerated distributed machine learning for multi-tenant settings. In Proceedings of the 11th ACM Symposium on Cloud Computing. 447–461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, arxiv:1804.07461 Google ScholarGoogle Scholar
  70. Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940. Google ScholarGoogle Scholar
  71. Xilinx. 2023. Virtex UltraScale - Xilinx. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale.html#productAdvantages Google ScholarGoogle Scholar
  72. Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast distributed deep learning over rdma. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Weihong Yang, Yang Qin, Zukai Jiang, and Xiaowen Chu. 2021. Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks. In ICC 2021-IEEE International Conference on Communications. 1–6. Google ScholarGoogle Scholar
  74. Yifan Yuan, Omar Alama, Jiawei Fei, Jacob Nelson, Dan R. K. Ports, Amedeo Sapio, Marco Canini, and Nam Sung Kim. 2022. Unlocking the Power of Inline Floating-Point Operations on Programmable Switches. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Google ScholarGoogle Scholar
  75. Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. 2019. Harmonia: Near-Linear Scalability for Replicated Storage with in-Network Conflict Detection. Proc. VLDB Endow., 13, 3 (2019), Nov., 376–389. issn:2150-8097 https://doi.org/10.14778/3368289.3368301 Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA. 523–536. isbn:9781450335423 https://doi.org/10.1145/2785956.2787484 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. In-Network Aggregation with Transport Transparency for Distributed Training

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
      March 2023
      820 pages
      ISBN:9781450399180
      DOI:10.1145/3582016

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 March 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate535of2,713submissions,20%

      Upcoming Conference

    • Article Metrics

      • Downloads (Last 12 months)659
      • Downloads (Last 6 weeks)47

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader