research-article

Open access

RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

Authors:

Yi WangAuthors Info & Claims

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Pages 71 - 85

https://doi.org/10.1145/3651890.3672231

Published: 04 August 2024 Publication History

Abstract

Internet traffic bursts usually happen within a second, thus conventional burst mitigation methods ignore the potential of Traffic Engineering (TE). However, our experiments indicate that a TE system, with a sub-second control loop latency, can effectively alleviate burst-induced congestion. TE-based methods can leverage network-wide tunnel-level information to make globally informed decisions (e.g., balancing traffic bursts among multiple paths). Our insight in reducing control loop latency is to let each router make local TE decisions, but this introduces the key challenge of minimizing performance loss compared to centralized TE systems.

In this paper, we present RedTE, a novel distributed TE system with a control loop latency of < 100ms, while achieving performance comparable to centralized TE systems. RedTE's innovation is the modeling of TE as a distributed cooperative multi-agent problem, and we design a novel multi-agent deep reinforcement learning algorithm to solve it, which enables each agent to make globally informed decisions solely based on local information. We implement real RedTE routers and deploy them on a WAN spanning six city datacenters. Evaluation reveals notable improvements compared to existing solutions: < 100ms of control loop latency, a 37.4% reduction in maximum link utilization, and a 78.9% reduction in average queue length.

References

[1]

2022. cisco dynamic load balance. https://community.cisco.com/t5/routing/dynamic-load-balancing/td-p/646603.

[2]

2022. Gurobi. https://www.gurobi.com.

[3]

2022. huawei dynamic load balance. https://support.huawei.com/enterprise/en/doc/EDOC1100169990/75e82656/configuring-dynamic-load-balancing.

[4]

2022. Internet Topology Zoo. http://www.topology-zoo.org/dataset.html.

[5]

2022. MAWI Working Group Traffic Archive. https://mawi.wide.ad.jp/mawi/.

[6]

2024. gRPC. https://grpc.io.

[7]

2024. Network Simulator 3. [EB/OL]. https://www.nsnam.org/.

[8]

Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. 2021. Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21). 175--200.

[9]

Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference. 63--74.

Digital Library

[10]

Petri Aukia, Murali Kodialam, Pramod VN Koppol, TV Lakshman, Helena Sarin, and Bernhard Suter. 2000. RATES: A server for MPLS traffic engineering. IEEE Network 14, 2 (2000), 34--41.

Digital Library

[11]

Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning. In Machine Intelligence 15. 103--129.

[12]

Guillermo Bernárdez, José Suárez-Varela, Albert López, Bo Wu, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2021. Is Machine Learning Ready for Traffic Engineering Optimization? arXiv preprint arXiv:2109.01445 (2021).

[13]

Justin A Boyan and Michael L Littman. 1994. Packet routing in dynamically changing networks: A reinforcement learning approach. In Advances in neural information processing systems. 671--678.

[14]

Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2016. Bbr: Congestion-based congestion control: Measuring bottleneck bandwidth and round-trip propagation time. Queue 14, 5 (2016), 20--53.

Digital Library

[15]

Cisco. 2020. Best Practices in Core Network Capacity Planning White Paper. https://www.cisco.com/c/en/us/products/collateral/routers/wan-automation-engine/white_paper_c11-728551.html.

[16]

Anwar Elwalid, Cheng Jin, Steven Low, and Indra Widjaja. 2001. MATE: MPLS adaptive traffic engineering. In Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No. 01CH37213), Vol. 3. IEEE, 1300--1309.

[17]

Clarence Filsfils, Nagendra Kumar Nainar, Carlos Pignataro, Juan Camilo Cardona, and Pierre Francois. 2015. The segment routing architecture. In 2015 IEEE Global Communications Conference (GLOBECOM). IEEE, 1--6.

[18]

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[19]

Romain Fontugne, Patrice Abry, Kensuke Fukuda, Darryl Veitch, Kenjiro Cho, Pierre Borgnat, and Herwig Wendt. 2017. Scaling in internet traffic: a 14 year and 3 day longitudinal study, with multiscale analyses and random projections. IEEE/ACM Transactions on Networking 25, 4 (2017), 2152--2165.

Digital Library

[20]

Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Xizheng Wang, Ran Zhang, and Lu Lu. 2023. Dons: Fast and affordable discrete event network simulation with automatic parallelization. In Proceedings of the ACM SIGCOMM 2023 Conference. 167--181.

Digital Library

[21]

Kaihui Gao, Dan Li, Li Chen, Jinkun Geng, Fei Gui, Yang Cheng, and Yue Gu. 2020. Incorporating intra-flow dependencies and inter-flow correlations for traffic matrix prediction. In 2020 IEEE/ACM 28th IWQoS.

[22]

Kaihui Gao, Chen Sun, Shuai Wang, Dan Li, Yu Zhou, Hongqiang Harry Liu, Lingjun Zhu, and Ming Zhang. 2022. Buffer-based end-to-end request event monitoring in the cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 829--843.

[23]

Nan Geng, Mingwei Xu, Yuan Yang, Enhuan Dong, and Chenyi Liu. 2020. Adaptive and Low-cost Traffic Engineering based on Traffic Matrix Classification. In 2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, 1--9.

[24]

Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. 2021. Distributed and Adaptive Traffic Engineering with Deep Reinforcement Learning. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 1--10.

[25]

Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33, 6 (2019), 750--797.

Digital Library

[26]

Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. 2013. Achieving high utilization with software-driven WAN. In ACM SIGCOMM Computer Communication Review, Vol. 43. ACM, 15--26.

Digital Library

[27]

Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, et al. 2018. B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined WAN. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 74--87.

Digital Library

[28]

Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. 2013. B4: Experience with a globally-deployed software defined WAN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 3--14.

Digital Library

[29]

Hao Jiang and Constantinos Dovrolis. 2005. Why is the internet traffic bursty in short time scales?. In Proceedings of the 2005 ACM SIGMETRICS international Conference on Measurement and Modeling of Computer Systems. 241--252.

Digital Library

[30]

Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. 2005. Walking the tightrope: Responsive yet stable traffic engineering. ACM SIGCOMM Computer Communication Review 35, 4 (2005), 253--264.

Digital Library

[31]

Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. 2018. Semi-oblivious traffic engineering: The road not taken. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). USENIX.

Digital Library

[32]

Adam Langley, Alistair Riddoch, Alyssa Wilk, Antonio Vicente, Charles Krasic, Dan Zhang, Fan Yang, Fedor Kouranov, Ian Swett, Janardhan Iyengar, et al. 2017. The quic transport protocol: Design and internet-scale deployment. In Proceedings of the conference of the ACM special interest group on data communication. 183--196.

Digital Library

[33]

Georgios Y Lazarou, Julie Baca, Victor S Frost, and Joseph B Evans. 2009. Describing network traffic using the index of variability. IEEE/ACM Transactions On Networking 17, 5 (2009), 1672--1683.

Digital Library

[34]

Dan Li, Yunfei Shang, Wu He, and Congjie Chen. 2014. EXR: Greening data center network with software defined exclusive routing. IEEE Trans. Comput. 64, 9 (2014), 2534--2544.

Digital Library

[35]

Dan Li, Yirong Yu, Wu He, Kai Zheng, and Bingsheng He. 2014. Willow: Saving data center network energy for network-limited flows. IEEE Transactions on Parallel and Distributed Systems 26, 9 (2014), 2610--2620.

Digital Library

[36]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[37]

Ryan Lowe, Jean Harb, Aviv Tamar, Pieter Abbeel, and Igor Mordatch. 2018. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In 31st Conference on Neural Information Processing Systems (NIPS.

[38]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.

[39]

Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shujun Zhuang, Bo Li, Shuguang Cheng, Jiaqi Gao, Yan Zhuang, Pengcheng Zhang, et al. 2022. From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud. In Proceedings of the ACM SIGCOMM 2022 Conference. 753--766.

Digital Library

[40]

Nithin Michael and Ao Tang. 2014. Halo: Hop-by-hop adaptive link-state optimal routing. IEEE/ACM Transactions on Networking 23, 6 (2014), 1862--1875.

Digital Library

[41]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529--533.

[42]

Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. 2021. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 521--537.

Digital Library

[43]

Xena network. 2009. White paper: Is your network prepared for microbursts? https://www.xenanetworks.com/wp-content/uploads/2019/11/Microburst_WP.pdf (2009).

[44]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[45]

Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. 2023. DOTE: Rethinking (Predictive) WAN Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 1557--1581.

[46]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[47]

Ying Tian, Zhiliang Wang, Xia Yin, Xingang Shi, Yingya Guo, Haijun Geng, and Jiahai Yang. 2020. Traffic Engineering in Partially Deployed Segment Routing Over IPv6 Network With Deep Reinforcement Learning. IEEE/ACM Transactions on Networking 28, 4 (2020), 1573--1586.

Digital Library

[48]

Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route with deep rl. In NIPS Deep Reinforcement Learning Symposium.

[49]

Shuai Wang, Kaihui Gao, Kun Qian, Dan Li, Rui Miao, Bo Li, Yu Zhou, Ennan Zhai, Chen Sun, Jiaqi Gao, et al. 2022. Predictable vFabric on informative data plane. In Proceedings of the ACM SIGCOMM 2022 Conference. 615--632.

Digital Library

[50]

Yanshu Wang, Dan Li, Yuanwei Lu, Jianping Wu, Hua Shao, and Yutian Wang. 2022. Elixir: A High-performance and Low-cost Approach to Managing {Hardware/Software} Hybrid Flow Tables Considering Flow Burstiness. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 535--550.

[51]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3--4 (1992), 279--292.

[52]

Zhiyuan Xu, Jian Tang, Jingsong Meng, et al. 2018. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1871--1879.

Digital Library

[53]

Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, and Minlan Yu. 2023. Teal: Learning-Accelerated Optimization of WAN Traffic Engineering. In Proceedings of the ACM SIGCOMM 2023 Conference (New York, NY, USA) (ACM SIGCOMM '23). 378--393.

Digital Library

[54]

Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H Jonathan Chao. 2020. CFR-RL: Traffic engineering with reinforcement learning in SDN. IEEE Journal on Selected Areas in Communications 38, 10 (2020), 2249--2259.

[55]

Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. 2014. WCMP: Weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems. 1--14.

Digital Library

Cited By

Liu JLi DXu Y(2024)Deep Distributional Reinforcement Learning-Based Adaptive Routing With Guaranteed Delay BoundsIEEE/ACM Transactions on Networking10.1109/TNET.2024.342565232:6(4692-4706)Online publication date: Dec-2024
https://doi.org/10.1109/TNET.2024.3425652

Index Terms

RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
2. Networks
  1. Network algorithms
    1. Control path algorithms
      1. Traffic engineering algorithms

Recommendations

Multipath traffic engineering in WDM optical burst switching networks

In this paper, we investigate the problem of multipath traffic engineering in optical burst switching (OBS) networks. The main goal of this work is to minimize burst loss rate in the network by adaptively balancing the burst traffic among multiple paths ...
Wavelength Selection in OBS Networks Using Traffic Engineering and Priority-Based Concepts

A fundamental assumption underlying most studies of optical burst switched (OBS) networks is that full wavelength conversion is available throughout the network. In practice, however, economic and technical considerations are likely to dictate a more ...
Interleaved Traffic Splitting: A promising technique to solve False Timeout

In this paper, we analyze the False Timeout (FTO) problem that TCP flows suffered in OBS networks and propose Interleaved Traffic Splitting (ITS) to solve the problem. We show that the collision loss of ACK bursts may also cause FTO, which has been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

August 2024

1033 pages

ISBN:9798400706141

DOI:10.1145/3651890

Co-chairs:
Aruna Seneviratne,
Darryl Veitch,
Program Co-chairs:
Vyas Sekar,
Minlan Yu

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Key R&D Program of China
the National Natural Science Foundation of China

Conference

ACM SIGCOMM '24

Sponsor:

SIGCOMM

ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference

August 4 - 8, 2024

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,726
Total Downloads

Downloads (Last 12 months)1,726
Downloads (Last 6 weeks)271

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JLi DXu Y(2024)Deep Distributional Reinforcement Learning-Based Adaptive Routing With Guaranteed Delay BoundsIEEE/ACM Transactions on Networking10.1109/TNET.2024.342565232:6(4692-4706)Online publication date: Dec-2024
https://doi.org/10.1109/TNET.2024.3425652

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten