research-article

BULB: Lightweight and Automated Load Balancing for Fast Datacenter Networks

Authors:

Heng QiAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 60, Pages 1 - 11

https://doi.org/10.1145/3545008.3545021

Published: 13 January 2023 Publication History

Abstract

Load balancing is essential for datacenter networks. However, prior solutions have significant limitations: they either are oblivious to congestion or involve a daunting and time-consuming parameter-tunning task over their heuristics for achieving good performance. Thus, we ask: is it possible to learn to balance datacenter traffic? While deep reinforcement learning (DRL) sounds like a good answer, we observe that it is too heavyweight due to the long decision-making latency. Therefore, we introduce BULB, a lightweight and automated datacenter load balancer. BULB learns link weights to guide the end-hosts to spread traffic, so as to free the central agent from quick flow-level decision-making. BULB offline trains a DRL agent for optimizing link weights but employs an imitation learning based approach to faithfully translate this agent’s DNN to a decision tree for online deployment. We implement a BULB prototype with a popular machine learning framework and evaluate it extensively in ns-3. The results show that BULB achieves up to 36.6%/56.4%, 19.9%/42.5%, 35.9%/54.8%, and 45.1%/67.7% better average/tail flow completion time than ECMP, CONGA, LetFlow, and Hermes, respectively. Moreover, BULB reduces the decision latency by 175 times while incurring only 2% performance loss after converting the DNN into a decision tree.

References

[1]

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63–74.

[2]

Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat, 2010. Hedera: dynamic flow scheduling for data center networks. In Nsdi, Vol. 10. San Jose, USA, 89–92.

[3]

Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proc. of 2014 ACM SIGCOMM.

Digital Library

[4]

Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center tcp (dctcp). In Proc. of ACM SIGCOMM.

Digital Library

[5]

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proc. of ACM SIGMETRICS.

Digital Library

[6]

Wei Bai, Li Chen, Kai Chen, and Haitao Wu. 2016. Enabling ECN in multi-service multi-queue data centers. In Proc. of USENIX NSDI.

[7]

Hitesh Ballani, Paolo Costa, Christos Gkantsidis, Matthew P Grosvenor, Thomas Karagiannis, Lazaros Koromilas, and Greg O’Shea. 2015. Enabling end-host network functions. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 493–507.

Digital Library

[8]

Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. 2018. Verifiable reinforcement learning via policy extraction. In Proc. of NIPS.

[9]

Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2011. MicroTE: Fine grained traffic engineering for data centers. In Proc. of ACM CoNext.

Digital Library

[10]

Hendrik Blockeel and Luc De Raedt. 1998. Top-down induction of first-order logical decision trees. Artificial intelligence 101, 1-2 (1998), 285–297.

[11]

Justin A Boyan and Michael L Littman. 1994. Packet routing in dynamically changing networks: A reinforcement learning approach. In Proc. of NIPS.

[12]

Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees. CRC press.

[13]

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. 2013. Per-packet load-balanced, low-latency routing for clos-based data center networks. In Proc. of ACM CoNext.

Digital Library

[14]

Li Chen, Justinas Lingys, Kai Chen, and Feng Liu. 2018. Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. In Proc. of ACM SIGCOMM.

Digital Library

[15]

Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proc. of ICML.

[16]

Francois Chollet. [n.d.]. Keras Documentation. https://keras.io/

[17]

Erik Einhorn and Andreas Mitschele-Thiel. 2008. RLTE: reinforcement learning for traffic-engineering. In Proc. of Springer IFIP.

Digital Library

[18]

Bernard Fortz and Mikkel Thorup. 2000. Internet traffic engineering by optimizing OSPF weights. In Proc. of IEEE INFOCOM.

[19]

Piotr Gawłowicz and Anatolij Zubow. 2018. ns3-gym: Extending openai gym for networking research. arXiv preprint arXiv:1810.03943(2018).

[20]

Soudeh Ghorbani, Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. 2015. Micro load balancing in data centers with DRILL. In Proc. of ACM HotNets.

Digital Library

[21]

Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In Proc. of ACM SIGCOMM.

Digital Library

[22]

Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, and Aditya Akella. 2015. Presto: Edge-based load balancing for fast datacenter networks. In Proc. of ACM SIGCOMM.

Digital Library

[23]

Christian Hopps 2000. Analysis of an equal-cost multi-path algorithm. Technical Report. RFC 2992, November.

[24]

Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. 2015. Explicit path control in commodity data centers: Design and applications. In Proc. of USENIX NSDI.

[25]

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR) 50, 2 (2017), 1–35.

Digital Library

[26]

Nathan Jay, Noga H Rotman, P Godfrey, Michael Schapira, and Aviv Tamar. 2018. Internet congestion control via deep reinforcement learning. arXiv preprint arXiv:1810.03259(2018).

[27]

Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. 2014. Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In Proc of ACM CoNext.

Digital Library

[28]

Naga Katta, Mukesh Hira, Aditi Ghag, Changhoon Kim, Isaac Keslassy, and Jennifer Rexford. 2016. CLOVE: How I learned to stop worrying about the core and love the edge. In Proc. of ACM HotNets.

Digital Library

[29]

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient convnets. In Proc. of ICLR.

[30]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971(2015).

[31]

W. Liu, J. Cai, Q. C. Chen, and Y. Wang. 2020. DRL-R: Deep reinforcement learning approach for intelligent routing in software-defined data-center networks. Journal of Network and Computer Applications (2020), 102865.

[32]

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proc. of NIPS.

[33]

Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource management with deep reinforcement learning. In Proc of ACM AC HotNet.

Digital Library

[34]

Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural adaptive video streaming with pensieve. In Proc. of ACM SIGCOMM.

Digital Library

[35]

Hongzi Mao, Matle Schwardzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proc. of ACM SIGCOMM.

Digital Library

[36]

Leonid Peshkin and Virginia Savova. 2002. Reinforcement learning for adaptive routing. In Proc. of IEEE IJCNN.

[37]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should I trust you? Explaining the predictions of any classifier. In Proc. of ACM SIGKDD.

[38]

Negar Rikhtegar, Omid Bushehrian, and Manijeh Keshtgari. 2021. DeepRLB: A deep reinforcement learning-based load balancing in data center networks. International Journal of Communication Systems 34, 15 (2021).

[39]

Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, 2015. Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. In Proc. of ACM SIGCOMM.

Digital Library

[40]

snowzjx. [n.d.]. ns3-load-balance. https://github.com/snowzjx/ns3-load-balance

[41]

Richard S Sutton, Andrew G Barto, 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.

[42]

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems.

[43]

Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route with deep rl. In Proc. of NIPS.

[44]

Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall. 2017. Let it flow: Resilient asymmetric load balancing with flowlet switching. In Proc. of USENIX NSDI.

Digital Library

[45]

Peng Wang, Hong Xu, Zhixiong Niu, Dongsu Han, and Yongqiang Xiong. 2016. Expeditus: Congestion-aware load balancing in clos data center networks. In Proc. of ACM SoCC.

Digital Library

[46]

Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. 2018. Experience-driven networking: A deep reinforcement learning based approach. In Proc. of IEEE INFOCOM.

Digital Library

[47]

Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. 2017. Resilient datacenter load balancing in the wild. In Proc. of ACM SIGCOMM.

Digital Library

[48]

Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. 2014. WCMP: Weighted cost multipathing for improved fairness in data centers. In Proc of EuroSys.

Digital Library

Cited By

Liu YMeng QChen KShen Z(2025)ALB-TP: Adaptive Load Balancing based on Traffic Prediction using GRU-Attention for Software-Defined DCNsJournal of Network and Computer Applications10.1016/j.jnca.2024.104103236(104103)Online publication date: Apr-2025
https://doi.org/10.1016/j.jnca.2024.104103
Hu JZhou ZZhang J(2024)Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345059621:6(6398-6408)Online publication date: Dec-2024
https://doi.org/10.1109/TNSM.2024.3450596
Hu JLi RLiu YWang J(2024)Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networksComputer Networks10.1016/j.comnet.2024.110867(110867)Online publication date: Oct-2024
https://doi.org/10.1016/j.comnet.2024.110867
Show More Cited By

Index Terms

BULB: Lightweight and Automated Load Balancing for Fast Datacenter Networks
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
2. Networks
  1. Network algorithms
    1. Control path algorithms
      1. Traffic engineering algorithms
  2. Network types
    1. Data center networks

Recommendations

RILNET: A Reinforcement Learning Based Load Balancing Approach for Datacenter Networks
Machine Learning for Networking
Abstract
Modern datacenter networks are facing various challenges, e.g., highly dynamic workloads, congestion, topology asymmetry. ECMP, as a traditional load balancing mechanism which is widely used in today’s datacenters, can balance load poorly and lead ...
Deep Reinforcement Learning Based Load Balancing for Heterogeneous Traffic in Datacenter Networks
Algorithms and Architectures for Parallel Processing
Abstract
Modern high-speed datacenter networks (DCNs) employ multi-tree topologies to provide large bisection bandwidth. Load balancing is crucial for making full use of parallel equal-cost paths and ensuring high link utilization. In the past decades, a ...
Load balancing in cloud computing: A big picture
Abstract
Scheduling or the allocation of user requests (tasks) in the cloud environment is an NP-hard optimization problem. According to the cloud infrastructure and the user requests, the cloud system is assigned with some load (that may be ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)11

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YMeng QChen KShen Z(2025)ALB-TP: Adaptive Load Balancing based on Traffic Prediction using GRU-Attention for Software-Defined DCNsJournal of Network and Computer Applications10.1016/j.jnca.2024.104103236(104103)Online publication date: Apr-2025
https://doi.org/10.1016/j.jnca.2024.104103
Hu JZhou ZZhang J(2024)Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2024.345059621:6(6398-6408)Online publication date: Dec-2024
https://doi.org/10.1109/TNSM.2024.3450596
Hu JLi RLiu YWang J(2024)Towards fine-grained load balancing with dynamical flowlet timeout in datacenter networksComputer Networks10.1016/j.comnet.2024.110867(110867)Online publication date: Oct-2024
https://doi.org/10.1016/j.comnet.2024.110867
Pang YChen SLi WLiu HLi YHe XZhang SGuan ZSuo LLiu Y(2023)MiddleCache: Accelerating TCP based In-memory Key-value Stores using eBPF2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00324(2428-2435)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00324
Guan ZLi WHe XZhang SLi K(2023)dBFC: Destination-based Backpressure Flow Control for Incast2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00255(1853-1860)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00255
Tan XHan NLu SChen WWang D(2023)SemLog: A Semantics-based Approach for Anomaly Detection in Big Data System Logs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00174(1199-1206)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00174

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten