ABSTRACT
Traffic load balancing is a long time networking challenge. The dynamism of traffic and the increasing number of different workloads that flow through the network exacerbate the problem. This work presents QCMP, a Reinforcement-Learning based load balancing solution. QCMP is implemented within the data plane, providing dynamic policy adjustment with quick response to changes in traffic. QCMP is implemented using P4 on a switch-ASIC and using BMv2 in a simulation environment. Our results show that QCMP requires negligible resources, runs at line rate, and adapts quickly to changes in traffic patterns.
- Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, et al. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In ACM SIGCOMM, pages 503--514, 2014.Google Scholar
- Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. Auto: Scaling Deep Reinforcement Learning for Datacenter-Scale Automatic Traffic Optimization. In ACM SIGCOMM, pages 191--205, 2018.Google Scholar
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In ACM SIGCOMM, pages 350--361, 2011.Google ScholarDigital Library
- Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement Learning: A Survey. JAIR, 4:237--285, 1996.Google ScholarDigital Library
- Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. Hula: Scalable Load Balancing Using Programmable Data Planes. In ACM SOSR, pages 1--12, 2016.Google ScholarDigital Library
- Jingling Liu, Jiawei Huang, Wanchun Jiang, and Jianxin Wang. Survey on Load Balancing Mechanism in Data Center. Journal of Software, 32(2):300--326, 2020.Google Scholar
- Oliver Michel, Roberto Bifulco, Gabor Retvari, and Stefan Schmid. The Programmable Data Plane: Abstractions, Architectures, Algorithms, and Applications. ACM Computing Surveys (CSUR), 54(4):1--36, 2021.Google Scholar
- Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. Improving Datacenter Performance and Robustness with Multipath TCP. ACM SIGCOMM Computer Communication Review, 41(4):266--277, 2011.Google ScholarDigital Library
- Gavin A Rummery and Mahesan Niranjan. On-Line Q-Learning Using Connectionist Systems, volume 37. Citeseer, 1994.Google Scholar
- Kyle A Simpson and Dimitrios P Pezaros. Revisiting the Classics: Online RL in the Programmable Dataplane. In NOMS, IEEE/IFIP Network Operations and Management Symposium, pages 1--10. IEEE, 2022.Google ScholarDigital Library
- Giuseppe Siracusano, Salvator Galea, Davide Sanvito, Mohammad Malekzadeh, et al. Re-architecting Traffic Analysis with Neural Network Interface Cards. In USENIX NSDI, pages 513--533, 2022.Google Scholar
- Carl A Sunshine. Source Routing in Computer Networks. ACM SIGCOMM Computer Communication Review, 7(1):29--33, 1977.Google ScholarDigital Library
- Dave Thaler and C Hopps. Multipath Issues in Unicast and Multicast Next-Hop Selection. Technical report, 2000.Google ScholarDigital Library
- Christopher JCH Watkins and Peter Dayan. Q-Learning. Machine learning, 8:279--292, 1992.Google Scholar
- Jiao Zhang, F Richard Yu, Shuo Wang, Tao Huang, Zengyi Liu, and Yunjie Liu. Load balancing in data center networks: A survey. IEEE Communications Surveys & Tutorials, 20(3):2324--2352, 2018.Google ScholarCross Ref
- Changgang Zheng, Zhaoqi Xiong, Thanh T Bui, Siim Kaupmees, Riyad Bensoussane, Antoine Bernabeu, Shay Vargaftik, Yaniv Ben-Itzhak, and Noa Zilberman. IIsy: Practical In-Network Classification, 2022.Google Scholar
- Changgang Zheng, Mingyuan Zang, Xinpeng Hong, Riyad Bensoussane, Shay Vargaftik, Yaniv Ben-Itzhak, and Noa Zilberman. Automating In-Network Machine Learning, 2022.Google Scholar
- Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, et al. WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers. In ACM EuroSys, pages 1--14, 2014.Google Scholar
Index Terms
- QCMP: Load Balancing via In-Network Reinforcement Learning
Recommendations
Network Load Balancing with In-network Reordering Support for RDMA
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 ConferenceRemote Direct Memory Access (RDMA) is widely used in high-performance computing (HPC) and data center networks. In this paper, we first show that RDMA does not work well with existing load balancing algorithms because of its traffic flow characteristics ...
SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs
SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data CommunicationIn this paper, we show that up to hundreds of software load balancer (SLB) servers can be replaced by a single modern switching ASIC, potentially reducing the cost of load balancing by over two orders of magnitude. Today, large data centers typically ...
POSTER: Automated Load Balancer Selection Based on Application Characteristics
PPoPP '17Many HPC applications require dynamic load balancing to achieve high performance and system utilization. Different applications have different characteristics and hence require different load balancing strategies. Invocation of a suboptimal load ...
Comments