skip to main content
10.1145/3545008.3545021acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

BULB: Lightweight and Automated Load Balancing for Fast Datacenter Networks

Published: 13 January 2023 Publication History

Abstract

Load balancing is essential for datacenter networks. However, prior solutions have significant limitations: they either are oblivious to congestion or involve a daunting and time-consuming parameter-tunning task over their heuristics for achieving good performance. Thus, we ask: is it possible to learn to balance datacenter traffic? While deep reinforcement learning (DRL) sounds like a good answer, we observe that it is too heavyweight due to the long decision-making latency. Therefore, we introduce BULB, a lightweight and automated datacenter load balancer. BULB learns link weights to guide the end-hosts to spread traffic, so as to free the central agent from quick flow-level decision-making. BULB offline trains a DRL agent for optimizing link weights but employs an imitation learning based approach to faithfully translate this agent’s DNN to a decision tree for online deployment. We implement a BULB prototype with a popular machine learning framework and evaluate it extensively in ns-3. The results show that BULB achieves up to 36.6%/56.4%, 19.9%/42.5%, 35.9%/54.8%, and 45.1%/67.7% better average/tail flow completion time than ECMP, CONGA, LetFlow, and Hermes, respectively. Moreover, BULB reduces the decision latency by 175 times while incurring only 2% performance loss after converting the DNN into a decision tree.

References

[1]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63–74.
[2]
Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat, 2010. Hedera: dynamic flow scheduling for data center networks. In Nsdi, Vol. 10. San Jose, USA, 89–92.
[3]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proc. of 2014 ACM SIGCOMM.
[4]
Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center tcp (dctcp). In Proc. of ACM SIGCOMM.
[5]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proc. of ACM SIGMETRICS.
[6]
Wei Bai, Li Chen, Kai Chen, and Haitao Wu. 2016. Enabling ECN in multi-service multi-queue data centers. In Proc. of USENIX NSDI.
[7]
Hitesh Ballani, Paolo Costa, Christos Gkantsidis, Matthew P Grosvenor, Thomas Karagiannis, Lazaros Koromilas, and Greg O’Shea. 2015. Enabling end-host network functions. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 493–507.
[8]
Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. 2018. Verifiable reinforcement learning via policy extraction. In Proc. of NIPS.
[9]
Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2011. MicroTE: Fine grained traffic engineering for data centers. In Proc. of ACM CoNext.
[10]
Hendrik Blockeel and Luc De Raedt. 1998. Top-down induction of first-order logical decision trees. Artificial intelligence 101, 1-2 (1998), 285–297.
[11]
Justin A Boyan and Michael L Littman. 1994. Packet routing in dynamically changing networks: A reinforcement learning approach. In Proc. of NIPS.
[12]
Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees. CRC press.
[13]
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. 2013. Per-packet load-balanced, low-latency routing for clos-based data center networks. In Proc. of ACM CoNext.
[14]
Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. 2018. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proc. of ACM SIGCOMM.
[15]
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proc. of ICML.
[16]
Francois Chollet. [n.d.]. Keras Documentation. https://keras.io/
[17]
Erik Einhorn and Andreas Mitschele-Thiel. 2008. RLTE: reinforcement learning for traffic-engineering. In Proc. of Springer IFIP.
[18]
Bernard Fortz and Mikkel Thorup. 2000. Internet traffic engineering by optimizing OSPF weights. In Proc. of IEEE INFOCOM.
[19]
Piotr Gawłowicz and Anatolij Zubow. 2018. ns3-gym: Extending openai gym for networking research. arXiv preprint arXiv:1810.03943(2018).
[20]
Soudeh Ghorbani, Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. 2015. Micro load balancing in data centers with DRILL. In Proc. of ACM HotNets.
[21]
Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In Proc. of ACM SIGCOMM.
[22]
Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, and Aditya Akella. 2015. Presto: Edge-based load balancing for fast datacenter networks. In Proc. of ACM SIGCOMM.
[23]
Christian Hopps 2000. Analysis of an equal-cost multi-path algorithm. Technical Report. RFC 2992, November.
[24]
Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. 2015. Explicit path control in commodity data centers: Design and applications. In Proc. of USENIX NSDI.
[25]
Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR) 50, 2 (2017), 1–35.
[26]
Nathan Jay, Noga H Rotman, P Godfrey, Michael Schapira, and Aviv Tamar. 2018. Internet congestion control via deep reinforcement learning. arXiv preprint arXiv:1810.03259(2018).
[27]
Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. 2014. Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In Proc of ACM CoNext.
[28]
Naga Katta, Mukesh Hira, Aditi Ghag, Changhoon Kim, Isaac Keslassy, and Jennifer Rexford. 2016. CLOVE: How I learned to stop worrying about the core and love the edge. In Proc. of ACM HotNets.
[29]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient convnets. In Proc. of ICLR.
[30]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971(2015).
[31]
W. Liu, J. Cai, Q. C. Chen, and Y. Wang. 2020. DRL-R: Deep reinforcement learning approach for intelligent routing in software-defined data-center networks. Journal of Network and Computer Applications (2020), 102865.
[32]
Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proc. of NIPS.
[33]
Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. In Proc of ACM AC HotNet.
[34]
Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural adaptive video streaming with pensieve. In Proc. of ACM SIGCOMM.
[35]
Hongzi Mao, Matle Schwardzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proc. of ACM SIGCOMM.
[36]
Leonid Peshkin and Virginia Savova. 2002. Reinforcement learning for adaptive routing. In Proc. of IEEE IJCNN.
[37]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should I trust you? Explaining the predictions of any classifier. In Proc. of ACM SIGKDD.
[38]
Negar Rikhtegar, Omid Bushehrian, and Manijeh Keshtgari. 2021. DeepRLB: A deep reinforcement learning-based load balancing in data center networks. International Journal of Communication Systems 34, 15 (2021).
[39]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, 2015. Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. In Proc. of ACM SIGCOMM.
[40]
snowzjx. [n.d.]. ns3-load-balance. https://github.com/snowzjx/ns3-load-balance
[41]
Richard S Sutton, Andrew G Barto, 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.
[42]
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems.
[43]
Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route with deep rl. In Proc. of NIPS.
[44]
Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall. 2017. Let it flow: Resilient asymmetric load balancing with flowlet switching. In Proc. of USENIX NSDI.
[45]
Peng Wang, Hong Xu, Zhixiong Niu, Dongsu Han, and Yongqiang Xiong. 2016. Expeditus: Congestion-aware load balancing in clos data center networks. In Proc. of ACM SoCC.
[46]
Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. 2018. Experience-driven networking: A deep reinforcement learning based approach. In Proc. of IEEE INFOCOM.
[47]
Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. 2017. Resilient datacenter load balancing in the wild. In Proc. of ACM SIGCOMM.
[48]
Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. 2014. WCMP: Weighted cost multipathing for improved fairness in data centers. In Proc of EuroSys.

Cited By

View all
  • (2025)ALB-TP: Adaptive Load Balancing based on Traffic Prediction using GRU-Attention for Software-Defined DCNsJournal of Network and Computer Applications10.1016/j.jnca.2024.104103236(104103)Online publication date: Apr-2025
  • (2024)Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345059621:6(6398-6408)Online publication date: Dec-2024
  • (2024)Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networksComputer Networks10.1016/j.comnet.2024.110867(110867)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
August 2022
976 pages
ISBN:9781450397339
DOI:10.1145/3545008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRL
  2. Data center networks
  3. Imitation learning
  4. Load balancing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '22
ICPP '22: 51st International Conference on Parallel Processing
August 29 - September 1, 2022
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)ALB-TP: Adaptive Load Balancing based on Traffic Prediction using GRU-Attention for Software-Defined DCNsJournal of Network and Computer Applications10.1016/j.jnca.2024.104103236(104103)Online publication date: Apr-2025
  • (2024)Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345059621:6(6398-6408)Online publication date: Dec-2024
  • (2024)Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networksComputer Networks10.1016/j.comnet.2024.110867(110867)Online publication date: Oct-2024
  • (2023)MiddleCache: Accelerating TCP based In-memory Key-value Stores using eBPF2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00324(2428-2435)Online publication date: 17-Dec-2023
  • (2023)dBFC: Destination-based Backpressure Flow Control for Incast2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00255(1853-1860)Online publication date: 17-Dec-2023
  • (2023)SemLog: A Semantics-based Approach for Anomaly Detection in Big Data System Logs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00174(1199-1206)Online publication date: 17-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media