skip to main content
10.1145/3651890.3672231acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

Published: 04 August 2024 Publication History

Abstract

Internet traffic bursts usually happen within a second, thus conventional burst mitigation methods ignore the potential of Traffic Engineering (TE). However, our experiments indicate that a TE system, with a sub-second control loop latency, can effectively alleviate burst-induced congestion. TE-based methods can leverage network-wide tunnel-level information to make globally informed decisions (e.g., balancing traffic bursts among multiple paths). Our insight in reducing control loop latency is to let each router make local TE decisions, but this introduces the key challenge of minimizing performance loss compared to centralized TE systems.
In this paper, we present RedTE, a novel distributed TE system with a control loop latency of < 100ms, while achieving performance comparable to centralized TE systems. RedTE's innovation is the modeling of TE as a distributed cooperative multi-agent problem, and we design a novel multi-agent deep reinforcement learning algorithm to solve it, which enables each agent to make globally informed decisions solely based on local information. We implement real RedTE routers and deploy them on a WAN spanning six city datacenters. Evaluation reveals notable improvements compared to existing solutions: < 100ms of control loop latency, a 37.4% reduction in maximum link utilization, and a 78.9% reduction in average queue length.

References

[1]
2022. cisco dynamic load balance. https://community.cisco.com/t5/routing/dynamic-load-balancing/td-p/646603.
[2]
2022. Gurobi. https://www.gurobi.com.
[3]
2022. huawei dynamic load balance. https://support.huawei.com/enterprise/en/doc/EDOC1100169990/75e82656/configuring-dynamic-load-balancing.
[4]
2022. Internet Topology Zoo. http://www.topology-zoo.org/dataset.html.
[5]
2022. MAWI Working Group Traffic Archive. https://mawi.wide.ad.jp/mawi/.
[6]
2024. gRPC. https://grpc.io.
[7]
2024. Network Simulator 3. [EB/OL]. https://www.nsnam.org/.
[8]
Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. 2021. Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21). 175--200.
[9]
Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference. 63--74.
[10]
Petri Aukia, Murali Kodialam, Pramod VN Koppol, TV Lakshman, Helena Sarin, and Bernhard Suter. 2000. RATES: A server for MPLS traffic engineering. IEEE Network 14, 2 (2000), 34--41.
[11]
Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning. In Machine Intelligence 15. 103--129.
[12]
Guillermo Bernárdez, José Suárez-Varela, Albert López, Bo Wu, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2021. Is Machine Learning Ready for Traffic Engineering Optimization? arXiv preprint arXiv:2109.01445 (2021).
[13]
Justin A Boyan and Michael L Littman. 1994. Packet routing in dynamically changing networks: A reinforcement learning approach. In Advances in neural information processing systems. 671--678.
[14]
Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2016. Bbr: Congestion-based congestion control: Measuring bottleneck bandwidth and round-trip propagation time. Queue 14, 5 (2016), 20--53.
[15]
Cisco. 2020. Best Practices in Core Network Capacity Planning White Paper. https://www.cisco.com/c/en/us/products/collateral/routers/wan-automation-engine/white_paper_c11-728551.html.
[16]
Anwar Elwalid, Cheng Jin, Steven Low, and Indra Widjaja. 2001. MATE: MPLS adaptive traffic engineering. In Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No. 01CH37213), Vol. 3. IEEE, 1300--1309.
[17]
Clarence Filsfils, Nagendra Kumar Nainar, Carlos Pignataro, Juan Camilo Cardona, and Pierre Francois. 2015. The segment routing architecture. In 2015 IEEE Global Communications Conference (GLOBECOM). IEEE, 1--6.
[18]
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[19]
Romain Fontugne, Patrice Abry, Kensuke Fukuda, Darryl Veitch, Kenjiro Cho, Pierre Borgnat, and Herwig Wendt. 2017. Scaling in internet traffic: a 14 year and 3 day longitudinal study, with multiscale analyses and random projections. IEEE/ACM Transactions on Networking 25, 4 (2017), 2152--2165.
[20]
Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Xizheng Wang, Ran Zhang, and Lu Lu. 2023. Dons: Fast and affordable discrete event network simulation with automatic parallelization. In Proceedings of the ACM SIGCOMM 2023 Conference. 167--181.
[21]
Kaihui Gao, Dan Li, Li Chen, Jinkun Geng, Fei Gui, Yang Cheng, and Yue Gu. 2020. Incorporating intra-flow dependencies and inter-flow correlations for traffic matrix prediction. In 2020 IEEE/ACM 28th IWQoS.
[22]
Kaihui Gao, Chen Sun, Shuai Wang, Dan Li, Yu Zhou, Hongqiang Harry Liu, Lingjun Zhu, and Ming Zhang. 2022. Buffer-based end-to-end request event monitoring in the cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 829--843.
[23]
Nan Geng, Mingwei Xu, Yuan Yang, Enhuan Dong, and Chenyi Liu. 2020. Adaptive and Low-cost Traffic Engineering based on Traffic Matrix Classification. In 2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, 1--9.
[24]
Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. 2021. Distributed and Adaptive Traffic Engineering with Deep Reinforcement Learning. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 1--10.
[25]
Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33, 6 (2019), 750--797.
[26]
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. 2013. Achieving high utilization with software-driven WAN. In ACM SIGCOMM Computer Communication Review, Vol. 43. ACM, 15--26.
[27]
Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, et al. 2018. B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined WAN. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 74--87.
[28]
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. 2013. B4: Experience with a globally-deployed software defined WAN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 3--14.
[29]
Hao Jiang and Constantinos Dovrolis. 2005. Why is the internet traffic bursty in short time scales?. In Proceedings of the 2005 ACM SIGMETRICS international Conference on Measurement and Modeling of Computer Systems. 241--252.
[30]
Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. 2005. Walking the tightrope: Responsive yet stable traffic engineering. ACM SIGCOMM Computer Communication Review 35, 4 (2005), 253--264.
[31]
Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. 2018. Semi-oblivious traffic engineering: The road not taken. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). USENIX.
[32]
Adam Langley, Alistair Riddoch, Alyssa Wilk, Antonio Vicente, Charles Krasic, Dan Zhang, Fan Yang, Fedor Kouranov, Ian Swett, Janardhan Iyengar, et al. 2017. The quic transport protocol: Design and internet-scale deployment. In Proceedings of the conference of the ACM special interest group on data communication. 183--196.
[33]
Georgios Y Lazarou, Julie Baca, Victor S Frost, and Joseph B Evans. 2009. Describing network traffic using the index of variability. IEEE/ACM Transactions On Networking 17, 5 (2009), 1672--1683.
[34]
Dan Li, Yunfei Shang, Wu He, and Congjie Chen. 2014. EXR: Greening data center network with software defined exclusive routing. IEEE Trans. Comput. 64, 9 (2014), 2534--2544.
[35]
Dan Li, Yirong Yu, Wu He, Kai Zheng, and Bingsheng He. 2014. Willow: Saving data center network energy for network-limited flows. IEEE Transactions on Parallel and Distributed Systems 26, 9 (2014), 2610--2620.
[36]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[37]
Ryan Lowe, Jean Harb, Aviv Tamar, Pieter Abbeel, and Igor Mordatch. 2018. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In 31st Conference on Neural Information Processing Systems (NIPS.
[38]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.
[39]
Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shujun Zhuang, Bo Li, Shuguang Cheng, Jiaqi Gao, Yan Zhuang, Pengcheng Zhang, et al. 2022. From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud. In Proceedings of the ACM SIGCOMM 2022 Conference. 753--766.
[40]
Nithin Michael and Ao Tang. 2014. Halo: Hop-by-hop adaptive link-state optimal routing. IEEE/ACM Transactions on Networking 23, 6 (2014), 1862--1875.
[41]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529--533.
[42]
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 521--537.
[43]
Xena network. 2009. White paper: Is your network prepared for microbursts? https://www.xenanetworks.com/wp-content/uploads/2019/11/Microburst_WP.pdf (2009).
[44]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[45]
Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. 2023. DOTE: Rethinking (Predictive) WAN Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 1557--1581.
[46]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[47]
Ying Tian, Zhiliang Wang, Xia Yin, Xingang Shi, Yingya Guo, Haijun Geng, and Jiahai Yang. 2020. Traffic Engineering in Partially Deployed Segment Routing Over IPv6 Network With Deep Reinforcement Learning. IEEE/ACM Transactions on Networking 28, 4 (2020), 1573--1586.
[48]
Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route with deep rl. In NIPS Deep Reinforcement Learning Symposium.
[49]
Shuai Wang, Kaihui Gao, Kun Qian, Dan Li, Rui Miao, Bo Li, Yu Zhou, Ennan Zhai, Chen Sun, Jiaqi Gao, et al. 2022. Predictable vFabric on informative data plane. In Proceedings of the ACM SIGCOMM 2022 Conference. 615--632.
[50]
Yanshu Wang, Dan Li, Yuanwei Lu, Jianping Wu, Hua Shao, and Yutian Wang. 2022. Elixir: A High-performance and Low-cost Approach to Managing {Hardware/Software} Hybrid Flow Tables Considering Flow Burstiness. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 535--550.
[51]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3--4 (1992), 279--292.
[52]
Zhiyuan Xu, Jian Tang, Jingsong Meng, et al. 2018. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1871--1879.
[53]
Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, and Minlan Yu. 2023. Teal: Learning-Accelerated Optimization of WAN Traffic Engineering. In Proceedings of the ACM SIGCOMM 2023 Conference (New York, NY, USA) (ACM SIGCOMM '23). 378--393.
[54]
Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H Jonathan Chao. 2020. CFR-RL: Traffic engineering with reinforcement learning in SDN. IEEE Journal on Selected Areas in Communications 38, 10 (2020), 2249--2259.
[55]
Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. 2014. WCMP: Weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems. 1--14.

Cited By

View all
  • (2024)Deep Distributional Reinforcement Learning-Based Adaptive Routing With Guaranteed Delay BoundsIEEE/ACM Transactions on Networking10.1109/TNET.2024.342565232:6(4692-4706)Online publication date: Dec-2024

Index Terms

  1. RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
      August 2024
      1033 pages
      ISBN:9798400706141
      DOI:10.1145/3651890
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 August 2024

      Check for updates

      Author Tags

      1. traffic engineering
      2. network optimization
      3. machine learning

      Qualifiers

      • Research-article

      Funding Sources

      • the National Key R&D Program of China
      • the National Natural Science Foundation of China

      Conference

      ACM SIGCOMM '24
      Sponsor:
      ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
      August 4 - 8, 2024
      NSW, Sydney, Australia

      Acceptance Rates

      Overall Acceptance Rate 462 of 3,389 submissions, 14%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,726
      • Downloads (Last 6 weeks)271
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Deep Distributional Reinforcement Learning-Based Adaptive Routing With Guaranteed Delay BoundsIEEE/ACM Transactions on Networking10.1109/TNET.2024.342565232:6(4692-4706)Online publication date: Dec-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media