skip to main content
10.1145/3651890.3672257acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

A Decentralized SDN Architecture for the WAN

Published: 04 August 2024 Publication History

Abstract

Motivated by our experiences operating a global WAN, we argue that SDN's reliance on infrastructure external to the data plane has substantially complicated the challenge of maintaining high availability. We propose a new decentralized SDN (dSDN) architecture in which SDN control logic instead runs within routers, eliminating the control plane's reliance on external infrastructure and restoring fate-sharing between control and data planes. We present dSDN as a simpler approach to realizing the benefits of SDN in the WAN. Despite its much simpler design, we show that dSDN is practical from an implementation viewpoint, and outperforms centralized SDN in terms of routing convergence and SLO impact.

References

[1]
Alia Atlas, George Swallow, and Ping Pan. 2005. Fast Reroute Extensions to RSVPTE for LSP Tunnels. RFC 4090. (May 2005).
[2]
Daniel O. Awduche, Lou Berger, Der-Hwa Gan, Tony Li, Dr. Vijay Srinivasan, and George Swallow. 2001. RSVP-TE: Extensions to RSVP for LSP Tunnels. RFC 3209. (Dec. 2001).
[3]
Ahmed Bashandy, Clarence Filsfils, Stefano Previdi, Bruno Decraene, Stephane Litkowski, and Rob Shakir. 2019. Segment Routing with the MPLS Data Plane. RFC 8660. (Dec. 2019).
[4]
Theophilus Benson, Aditya Akella, and David Maltz. 2009. Unraveling the complexity of network management. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI'09). USENIX Association, USA, 335--348.
[5]
Pankaj Berde, Matteo Gerola, Jonathan Hart, Yuta Higuchi, Masayoshi Kobayashi, Toshio Koide, Bob Lantz, Brian O'Connor, Pavlin Radoslavov, William Snow, and Guru Parulkar. 2014. ONOS: towards an open, distributed SDN OS. In Proceedings of the Third Workshop on Hot Topics in Software Defined Networking (HotSDN '14). Association for Computing Machinery, New York, NY, USA, 1--6.
[6]
Robert T. Braden, Lixia Zhang, Steven Berson, Shai Herzog, and Sugih Jamin. 1997. Resource ReSerVation Protocol (RSVP) - Version 1 Functional Specification. RFC 2205. (Sept. 1997).
[7]
Ross Callon. 1990. Use of OSI IS-IS for routing in TCP/IP and dual environments. RFC 1195. (Dec. 1990).
[8]
Carl Lebsack, Marcus Hines, Paul Borman, Anees Shaikh, Rob Shakir, Wen Bo Li, et al. 2018. gNMI - gRPC Network Management Interface. https://github.com/openconfig/reference/blob/master/rpc/gnmi/gnmi-specification.md. (Jun 2018).
[9]
Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, Saman Kazemkhani, Rob Sherwood, Ying Zhang, and Hongyi Zeng. 2018. FBOSS: Building Switch Software at Scale. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 342--356.
[10]
Byung-Gon Chun, Sylvia Ratnasamy, and Eddie Kohler. 2008. NetComplex: A Complexity Metric for Networked System Designs. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI'08). USENIX Association, USA, 393--406.
[11]
Marek Denis, Yuanjun Yao, Ashley Hatch, Qin Zhang, Chiun Lin Lim, Shuqiang Zhang, Kyle Sugrue, Henry Kwok, Mikel Jimenez Fernandez, Petr Lapukhov, et al. 2023. EBB: Reliable and evolvable express backbone network in meta. In Proceedings of the ACM SIGCOMM 2023 Conference. 346--359.
[12]
Nick Feamster, Jennifer Rexford, and Ellen Zegura. 2014. The road to SDN: an intellectual history of programmable networks. SIGCOMM Comput. Commun. Rev. 44, 2 (April 2014), 87--98.
[13]
Andrew D. Ferguson, Steve Gribble, Chi-Yao Hong, Charles Killian, Waqar Mohsin, Henrik Muehe, Joon Ong, Leon Poutievski, Arjun Singh, Lorenzo Vicisano, Richard Alimi, Shawn Shuoshuo Chen, Mike Conley, Subhasree Mandal, Karthik Nagaraj, Kondapa Naidu Bollineni, Amr Sabaa, Shidong Zhang, Min Zhu, and Amin Vahdat. 2021. Orion: Google's Software-Defined Networking Control Plane. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 83--98. https://www.usenix.org/conference/nsdi21/presentation/ferguson
[14]
Mikel Jimenez Fernandez and Henry Kwok. 2017. Building Express Backbone: Facebook's new long-haul network. Engineering at Meta (May 2017).
[15]
Clarence Filsfils, Stefano Previdi, Les Ginsberg, Bruno Decraene, Stephane Litkowski, and Rob Shakir. 2018. Segment Routing Architecture. RFC 8402. (July 2018).
[16]
Tony Fyler. 2023. Azure Outage Disconnects Thousands. https://techhq.com/2023/01/azure-outage-disconnects-thousands. (Jan. 2023). Accessed: 2024-1-30.
[17]
P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica. 2009. Pathlet Routing. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (SIGCOMM '09). Association for Computing Machinery, New York, NY, USA, 111--122.
[18]
Deepthi Gopi, Samuel Cheng, and Robert Huck. 2017. Comparative analysis of SDN and conventional networks using routing protocols. In 2017 International Conference on Computer, Information and Telecommunication Systems (CITS). 108--112.
[19]
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 58--72.
[20]
Timothy G Griffin and Brian J Premore. 2001. An experimental analysis of BGP convergence time. In Proceedings Ninth International Conference on Network Protocols. ICNP 2001. IEEE, 53--61.
[21]
Timothy G Griffin and Gordon Wilfong. 1999. An analysis of BGP convergence properties. SIGCOMM Comput. Commun. Rev. 29, 4 (Aug. 1999), 277--288.
[22]
Saif Hasan, Petr Lapukhov, Anuj Madan, and Omar Baldonado. 2017. Open/R: Open Routing for Modern Networks. https://engineering.fb.com/2017/11/15/connectivity/open-r-open-routing-for-modern-networks/. Engineering at Meta (Nov. 2017).
[23]
Daniel Hertzberg. 2018. Docker Containers on Arista EOS. Technical Report. Arista.
[24]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11). USENIX Association, USA, 295--308.
[25]
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, and Mohan Nanduri. 2013. Achieving High Utilization with Software-Driven WAN. Technical Report MSR-TR-2013-54. https://www.microsoft.com/en-us/research/publication/achieving-high-utilization-with-software-driven-wan/
[26]
Chi-Yao Hong, Subhasree Mandal, Mohammad A. Alfares, Min Zhu, Rich Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Jeffrey Liang, Kirill Mendelev, Steve Padgett, Faro Thomas Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jon Zolla, Joon Ong, and Amin Vahdat. 2018. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN. In SIGCOMM'18. https://conferences.sigcomm.org/sigcomm/2018/program_tuesday.html
[27]
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a Globally Deployed Software Defined WAN. In Proceedings of the ACM SIGCOMM Conference. Hong Kong, China. http://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf
[28]
Santosh Janardhan. 2021. Details About The October 4 Outage. https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/. Engineering at Meta (Oct. 2021). Accessed: 2024-1-30.
[29]
Juniper. 2021. Running Third-Party Applications in Containers. In Introducing Junos OS Evolved. Juniper, 84--88.
[30]
Simon Knight, Hung X Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. 2011. The Internet Topology Zoo. IEEE J. Sel. Areas Commun. 29, 9 (Oct. 2011), 1765--1775.
[31]
Umesh Krishnaswamy, Rachee Singh, Nikolaj Bjørner, and Himanshu Raj. 2022. Decentralized cloud wide-area network traffic engineering with BlastShield. Technical Report. Microsoft. 325--338 pages.
[32]
Umesh Krishnaswamy, Rachee Singh, Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan. 2023. OneWAN is better than two: Unifying a split WAN architecture. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 515--529. https://www.usenix.org/conference/nsdi23/presentation/krishnaswamy
[33]
Alok Kumar, Sushant Jain, Uday Naik, Anand Raghuraman, Nikhil Kasinadhuni, Enrique Cauich Zermeno, C Stephen Gunn, Jing Ai, Björn Carlin, Mihai Amarandei-Stavila, Mathieu Robin, Aspi Siganporia, Stephen Stuart, and Amin Vahdat. 2015. BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). Association for Computing Machinery, New York, NY, USA, 1--14.
[34]
Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. 2018. Semi-Oblivious Traffic Engineering: The Road Not Taken. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 157--170. https://www.usenix.org/conference/nsdi18/presentation/kumar
[35]
Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam Jahanian. 2001. Delayed Internet routing convergence. IEEE/ACM transactions on networking 9, 3 (2001), 293--306.
[36]
Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Tom Anderson, Scott Shenker, and Ion Stoica. 2007. Achieving Convergence-Free Routing Using Failure-Carrying Packets. SIGCOMM Comput. Commun. Rev. 37, 4 (aug 2007), 241--252.
[37]
Frederic Lardinois. 2020. IBM Cloud suffers prolonged outage. TechCrunch (June 2020).
[38]
Ki Suh Lee, Han Wang, and Hakim Weatherspoon. 2013. SoNIC: Precise Realtime Software Access and Control of Wired Networks. NSDI (April 2013).
[39]
Tony Li and Henk Smit. 2008. IS-IS Extensions for Traffic Engineering. RFC 5305. (Oct. 2008).
[40]
Marcus Hines, Rob Shakir, Sam Ribeiro, Eric Breverman, et al. 2018. gNOI - gRPC Network Operations Interface. https://github.com/openconfig/gnoi. (Dec 2018).
[41]
Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow: Enabling Innovation in Campus Networks. SIGCOMM Comput. Commun. Rev. 38, 2 (Mar 2008), 69--74.
[42]
Dirk Merkel. 2014. Docker: lightweight linux containers for consistent development and deployment. Linux journal 2014, 239 (2014), 2.
[43]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, USA, 305--320.
[44]
OpenConfig Project. 2015. OpenConfig. https://www.openconfig.net/. (2015).
[45]
Konstantina Papagiannaki, Nina Taft, and Anukool Lakhina. 2004. A Distributed Approach to Measure IP Traffic Matrices. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement (IMC '04). Association for Computing Machinery, New York, NY, USA, 161--174.
[46]
K. Papagiannakit, N. Taft, and C. Diot. 2004. Impact of flow dynamics on traffic engineering design principles. In IEEE INFOCOM 2004, Vol. 4. 2295--2306 vol.4.
[47]
Chetan Patel. 2022. MPLS 12-Label Push. Technical Report. Arista.
[48]
Abhinav Pathak, Ming Zhang, Y Charlie Hu, Ratul Mahajan, and Dave Maltz. 2011. Latency inflation with MPLS-based traffic engineering. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference (IMC '11). Association for Computing Machinery, New York, NY, USA, 463--472.
[49]
Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. 2023. {DOTE}: Rethinking (Predictive) {WAN} Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1557--1581.
[50]
Alejandro Ribeiro, Nikolaos D Sidiropoulos, and Georgios B Giannakis. 2008. Optimal Distributed Stochastic Routing Algorithms for Wireless Multihop Networks. IEEE Trans. Wireless Commun. 7, 11 (Nov. 2008), 4261--4272.
[51]
Rob Shakir, Xiao Wang, Nathaniel Flath, et al. 2017. gRIBI - gRPC Routing Information Base Interface. https://github.com/openconfig/gribi. (jul 2017).
[52]
Matthew Roughan, Albert Greenberg, Charles Kalmanek, Michael Rumsewicz, Jennifer Yates, and Yin Zhang. 2002. Experience in measuring backbone traffic variability: models, metrics, measurements and meaning. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurment (IMW '02). Association for Computing Machinery, New York, NY, USA, 91--92.
[53]
Matthew Roughan, Mikkel Thorup, and Yin Zhang. 2003. Traffic Engineering with Estimated Traffic Matrices. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement (IMC '03). Association for Computing Machinery, New York, NY, USA, 248--258.
[54]
Julien Meuric Scudder, Dhruv Dhody. 2023. Path Computation Element (PCE) Working Group Charter. (2023).
[55]
Mike Shand and Les Ginsberg. 2008. Restart Signaling for IS-IS. RFC 5306. (Oct. 2008).
[56]
Claude E Shannon. 1949. A theorem on coloring the lines of a network. J. Math. Phys. 28, 1--4 (April 1949), 148--152.
[57]
Adam Simpkins. 2015. Facebook Open Switching System ("FBOSS") and Wedge in the open. Technical Report.
[58]
Richard Speed. 2021. AWS runs into IT Problems. https://www.theregister.com/2021/12/15/aws_down. (Dec. 2021). Accessed: 2024-1-30.
[59]
Ashwin. Sridharan, Roch Guerin, and Christophe Diot. 2005. Achieving near-optimal traffic engineering solutions for current OSPF/IS-IS networks. IEEE/ACM Transactions on Networking 13, 2 (2005), 234--247.
[60]
Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 787--803. https://www.usenix.org/conference/osdi20/presentation/tang
[61]
The gRPC Authors. 2016. gRPC. https://grpc.io/. (2016). Accessed: 2024-1-30.
[62]
The Kubernetes Authors. 2015. Production-Grade Container Orchestration. https://kubernetes.io/. (2015).
[63]
Paul Tune and Matthew Roughan. 2013. Internet Traffic Matrices: A Primer. (2013).
[64]
Kannan Varadhan, Ramesh Govindan, and Deborah Estrin. 2000. Persistent route oscillations in inter-domain routing. Computer networks 32, 1 (2000), 1--16.
[65]
Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France.
[66]
Anoop Vetteth. 2019. Docker Inside Cisco Catalyst 9000 Switches. https://blogs.cisco.com/networking/application-hosting-on-catalyst-9000-series-switches. (June 2019).
[67]
WIRED. 2019. The Catch-22 that Broke the Internet. https://arstechnica.com/information-technology/2019/06/the-catch-22-that-broke-the-internet/. (June 2019).
[68]
Yunhong Xu, Keqiang He, Rui Wang, Minlan Yu, Nick Duffield, Hassan Wassel, Shidong Zhang, Leon Poutievski, Junlan Zhou, and Amin Vahdat. 2022. Hashing Design in Modern Networks: Challenges and Mitigation Techniques. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 805--818.

Cited By

View all
  • (2024)The Case for Validating Inputs in Software-Defined WANsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696874(246-254)Online publication date: 18-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
August 2024
1033 pages
ISBN:9798400706141
DOI:10.1145/3651890
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

  1. wide-area networks
  2. software-defined networking
  3. traffic engineering

Qualifiers

  • Research-article

Conference

ACM SIGCOMM '24
Sponsor:
ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
August 4 - 8, 2024
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,635
  • Downloads (Last 6 weeks)662
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Case for Validating Inputs in Software-Defined WANsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696874(246-254)Online publication date: 18-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media