skip to main content
10.1145/3603269.3610865acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
poster

Poster: Chameleon: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

Published:01 September 2023Publication History

ABSTRACT

Datacenter Quantized Congestion Notification (DCQCN) [12] is the default congestion control algorithm for Mellanox RDMA (Remote Direct Memory Access) NICs [2] in RoCEv2 (RDMA over Converged Ethernet v2) networks, one of the most widely used NICs in leading industry companies [4, 5, 7, 9]. In DCQCN, firstly switches mark packets with ECN (Explicit Congestion Notification) when the queue length exceeds ECN thresholds, then receivers respond to ECN-marked packets with CNPs (Congestion Notification Packets), and finally senders reduce transmission rate when receiving CNPs. DCQCN has 10+ parameters at both NICs and switches, including Alpha Update, Rate Increase & Decrease, Notification Point and ECN thresholds [3], and these parameters have a non-negligible impact on the network performance. Our experiments also verify the network performance of common AI (Artificial Intelligence) training workloads in RoCEv2 networks (e.g., all-to-all collective communication) is greatly influenced by different DCQCN parameter settings (§3). Therefore, when deploying applications in practice, the DCQCN parameters need to be carefully tested and tuned to improve the network performance.

References

  1. 2020. High Precision Congestion Control. (2020). https://github.com/alibaba-edu/High-Precision-Congestion-ControlGoogle ScholarGoogle Scholar
  2. 2022. DCQCN CC Algorithm. (2022). https://enterprise-support.nvidia.com/s/article/DCQCN-CC-algorithmGoogle ScholarGoogle Scholar
  3. 2023. DCQCN Parameters. (2023). https://enterprise-support.nvidia.com/s/article/dcqcn-parametersGoogle ScholarGoogle Scholar
  4. Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al. 2023. Empowering Azure Storage with RDMA. In NSDI.Google ScholarGoogle Scholar
  5. Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When Cloud Storage Meets RDMA. In NSDI.Google ScholarGoogle Scholar
  6. Yixiao Gao, Yuchen Yang, Tian Chen, Jiaqi Zheng, Bing Mao, and Guihai Chen. 2018. Dcqcn+: Taming large-scale incast congestion in rdma over ethernet networks. In ICNP.Google ScholarGoogle Scholar
  7. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In OSDI.Google ScholarGoogle Scholar
  8. Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. 2019. HPCC: High precision congestion control. In SIGCOMM.Google ScholarGoogle Scholar
  9. Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In ISCA.Google ScholarGoogle Scholar
  10. Kai Wang, Fang Dong, Dian Shen, Chengtian Zhang, Jinghui Zhang, and Junzhou Luo. 2021. Towards tunable RDMA parameter selection at runtime for datacenter applications. In CSCWD.Google ScholarGoogle Scholar
  11. Siyu Yan, Xiaoliang Wang, Xiaolong Zheng, Yinben Xia, Derui Liu, and Weishan Deng. 2021. ACC: Automatic ECN tuning for high-speed datacenter networks. In SIGCOMM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In SIGCOMM.Google ScholarGoogle Scholar

Index Terms

  1. Poster: Chameleon: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
      September 2023
      1217 pages
      ISBN:9798400702365
      DOI:10.1145/3603269

      Copyright © 2023 Owner/Author(s)

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 September 2023

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate554of3,547submissions,16%
    • Article Metrics

      • Downloads (Last 12 months)190
      • Downloads (Last 6 weeks)25

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader