skip to main content
10.1145/3387514.3405863acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Fault Tolerant Service Function Chaining

Published: 30 July 2020 Publication History

Abstract

Network traffic typically traverses a sequence of middleboxes forming a service function chain, or simply a chain. Tolerating failures when they occur along chains is imperative to the availability and reliability of enterprise applications. Making a chain fault-tolerant is challenging since, in the event of failures, the state of faulty middleboxes must be correctly and quickly recovered while providing high throughput and low latency.
In this paper, we introduce FTC, a system design and protocol for fault-tolerant service function chaining. FTC provides strong consistency with up to f middlebox failures for chains of length f + 1 or longer without requiring dedicated replica nodes. In FTC, state updates caused by packet processing at a middlebox are collected, piggybacked onto the packet, and sent along the chain to be replicated. Our evaluation shows that compared with the state of art [51], FTC improves throughput by 2-3.5X for a chain of two to five middleboxes.

Supplementary Material

MP4 File (3387514.3405863.mp4)
The uploaded file is the 20 minutes presentation of "Fault Tolerant Service Function Chaining" at SIGCOMM 2020.

References

[1]
2017. NFV Whitepaper. Technical Report. European Telecommunications Standards Institute. https://portal.etsi.org/NFV/NFV_White_Paper.pdf
[2]
2019. mazu-nat.click. https://github.com/kohler/click/blob/master/conf/mazu-nat.click.
[3]
2020. Tuning Failover Cluster Network Thresholds. https://bit.ly/2NC7dGk. [Online].
[4]
Ali Abedi and Tim Brecht. 2017. Conducting Repeatable Experiments in Highly Variable Cloud Computing Environments. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (ICPE '17). ACM, New York, NY, USA, 287--292. https://doi.org/10.1145/3030207.3030229
[5]
David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 1--14. https://doi.org/10.1145/1629575.1629577
[6]
P Ayuso. 2006. Netfilter's connection tracking system. ;login 31, 3 (2006).
[7]
Pankaj Berde, Matteo Gerola, Jonathan Hart, Yuta Higuchi, Masayoshi Kobayashi, Toshio Koide, Bob Lantz, Brian O'Connor, Pavlin Radoslavov, William Snow, and Guru Parulkar. 2014. ONOS: Towards an Open, Distributed SDN OS. In Proceedings of the Third Workshop on Hot Topics in Software Defined Networking (HotSDN '14). ACM, New York, NY, USA, 1--6. https://doi.org/10.1145/2620728.2620744
[8]
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. Distributed Systems (2Nd Ed.). In Distributed Systems (2Nd Ed.), Sape Mullender (Ed.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, Chapter The Primary-backup Approach, 199--216. http://dl.acm.org/citation.cfm?id=302430.302438
[9]
B. Carpenter and S. Brim. 2002. Middleboxes: Taxonomy and Issues. RFC 3234. RFC Editor. 1-27 pages. http://www.rfc-editor.org/rfc/rfc3234.txt
[10]
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43, 2 (March 1996), 225--267. https://doi.org/10.1145/226643.226647
[11]
Adrian Cockcroft. 2012. A Closer Look At The Christmas Eve Outage. http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html.
[12]
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. 2008. Remus: High Availability via Asynchronous Virtual Machine Replication. In 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI 08). USENIX Association, San Francisco, CA.
[13]
Dave Dice, Yossi Lev, Virendra J. Marathe, Mark Moir, Dan Nussbaum, and Marek Olszewski. 2010. Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Association for Computing Machinery, New York, NY, USA, 325--334. https://doi.org/10.1145/1810479.1810537
[14]
Dave Dice, Ori Shalev, and Nir Shavit. 2006. Transactional Locking II. In Proceedings of the 20th International Conference on Distributed Computing (DISC'06). Springer-Verlag, Berlin, Heidelberg, 194--208. https://doi.org/10.1007/11864219_14
[15]
Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. 2009. RouteBricks: Exploiting Parallelism to Scale Software Routers. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 15--28. https://doi.org/10.1145/1629575.1629578
[16]
YaoZu Dong, Wei Ye, YunHong Jiang, Ian Pratt, ShiQing Ma, Jian Li, and HaiBing Guan. 2013. COLO: COarse-grained LOck-stepping Virtual Machines for Nonstop Service. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 3, 16 pages. https://doi.org/10.1145/2523616.2523630
[17]
Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, Florian Wohlfart, and Georg Carle. 2015. MoonGen: A Scriptable High-Speed Packet Generator. In Proceedings of the 2015 Internet Measurement Conference (IMC '15). ACM, New York, NY, USA, 275--287. https://doi.org/10.1145/2815675.2815692
[18]
Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: A Distributed, Searchable Key-value Store. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '12). ACM, New York, NY, USA, 25--36. https://doi.org/10.1145/2342356.2342360
[19]
Colin J Fidge. 1987. Timestamps in message-passing systems that preserve the partial ordering. Australian National University. Department of Computer Science.
[20]
N. Freed. 2000. Behavior of and Requirements for Internet Firewalls. RFC 2979. RFC Editor. 1-7 pages. http://www.rfc-editor.org/rfc/rfc2979.txt
[21]
Scott Lystig Fritchie. 2010. Chain Replication in Theory and in Practice. In Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang '10). ACM, New York, NY, USA, 33--44. https://doi.org/10.1145/1863509.1863515
[22]
Rohan Gandhi, Y. Charlie Hu, and Ming Zhang. 2016. Yoda: A Highly Available Layer-7 Load Balancer. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 21, 16 pages. https://doi.org/10.1145/2901318.2901352
[23]
Aaron Gember-Jacobson, Raajay Viswanathan, Chaithan Prakash, Robert Grandl, Junaid Khalid, Sourav Das, and Aditya Akella. 2014. OpenNF: Enabling Innovation in Network Function Control. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 163--174. https://doi.org/10.1145/2619239.2626313
[24]
Y. Gu, M. Shore, and S. Sivakumar. 2013. A Framework and Problem Statement for Flow-associated Middlebox State Migration. https://tools.ietf.org/html/draft-gu-statemigration-framework-03.
[25]
T. Hain. 2000. Architectural Implications of NAT. RFC 2993. RFC Editor. 1-29 pages. http://www.rfc-editor.org/rfc/rfc2993.txt
[26]
Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated Software Router. SIGCOMM Comput. Commun. Rev. 40, 4 (Aug. 2010), 195--206. https://doi.org/10.1145/1851275.1851207
[27]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-RTT Coordination. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 35--49. https://www.usenix.org/conference/nsdi18/presentation/jin
[28]
D. Joseph and I. Stoica. 2008. Modeling middleboxes. IEEE Network 22, 5 (September 2008), 20--25. https://doi.org/10.1109/MNET.2008.4626228
[29]
Murad Kablan, Azzam Alsudais, Eric Keller, and Franck Le. 2017. Stateless Network Functions: Breaking the Tight Coupling of State and Processing. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 97--112. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/kablan
[30]
J. M. Kang, H. Bannazadeh, and A. Leon-Garcia. 2013. SAVI testbed: Control and management of converged virtual ICT resources. In 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013). 664--667.
[31]
Naga Katta, Haoyu Zhang, Michael Freedman, and Jennifer Rexford. 2015. Ravana: Controller Fault-tolerance in Software-defined Networking. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR '15). ACM, New York, NY, USA, Article 4, 12 pages. https://doi.org/10.1145/2774993.2774996
[32]
Junaid Khalid and Aditya Akella. 2019. Correctness and Performance for Stateful Chained Network Functions. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 501--516. https://www.usenix.org/conference/nsdi19/presentation/khalid
[33]
Junaid Khalid, Aaron Gember-Jacobson, Roney Michael, Anubhavnidhi Abhashkumar, and Aditya Akella. 2016. Paving the Way for NFV: Simplifying Middlebox Modifications Using StateAlyzr. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara, CA, 239--253. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/khalid
[34]
Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The Click Modular Router. ACM Trans. Comput. Syst. 18, 3 (Aug. 2000), 263--297. https://doi.org/10.1145/354871.354874
[35]
Sameer G Kulkarni, Guyue Liu, KK Ramakrishnan, Mayutan Arumaithurai, Timothy Wood, and Xiaoming Fu. 2018. REINFORCE: Achieving Efficient Failure Resiliency for Network Function Virtualization based Services. In 15th USENIX International Conference on emerging Networking Experiments and Technologies (CoNEXT) 18). USENIX Association, 35--49.
[36]
Leslie Lamport. 2001. Paxos Made Simple. ACM SIGACT News 32, 4 (Dec. 2001), 18--25.
[37]
Mu Li, David G. Anderson, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Operating Systems Design and Implementation (OSDI). 583--598.
[38]
NiciraNetworks. 2019. OpenvSwitch: An open virtual switch. http://openvswitch.org.
[39]
NiciraNetworks. 2019. The published ONOS Docker images. https://hub.docker.com/r/onosproject/onos/.
[40]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). 305--319.
[41]
Aurojit Panda, Wenting Zheng, Xiaohe Hu, Arvind Krishnamurthy, and Scott Shenker. 2017. SCL: Simplifying Distributed SDN Control Planes. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 329--345. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/panda-aurojit-scl
[42]
Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. 2012. Improving Network Connection Locality on Multicore Systems. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys 12). ACM, New York, NY, USA, 337--350. https://doi.org/10.1145/2168836.2168870
[43]
Amar Phanishayee, David G. Andersen, Himabindu Pucha, Anna Povzner, and Wendy Belluomini. 2012. Flex-KV: Enabling High-performance and Flexible KV Systems. In Proceedings of the 2012 Workshop on Management of Big Data Systems (MBDS '12). ACM, New York, NY, USA, 19--24. https://doi.org/10.1145/2378356.2378361
[44]
Rahul Potharaju and Navendu Jain. 2013. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC '13). ACM, New York, NY, USA, 9--22. https://doi.org/10.1145/2504730.2504737
[45]
Zafar Ayyub Qazi, Cheng-Chun Tu, Luis Chiang, Rui Miao, Vyas Sekar, and Minlan Yu. 2013. SIMPLE-fying Middlebox Policy Enforcement Using SDN. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM '13). ACM, New York, NY, USA, 27--38. https://doi.org/10.1145/2486001.2486022
[46]
Paul Quinn and Thomas Nadeau. 2015. Problem Statement for Service Function Chaining. Internet-Draft. IETF. https://tools.ietf.org/html/rfc7498
[47]
Shriram Rajagopalan, Dan Williams, and Hani Jamjoom. 2013. Pico Replication: A High Availability Framework for Middleboxes. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 1, 15 pages. https://doi.org/10.1145/2523616.2523635
[48]
Daniel J Scales, Mike Nelson, and Ganesh Venkitachalam. 2010. The design and evaluation of a practical system for fault-tolerant virtual machines. Technical Report. Technical Report VMWare-RT-2010-001, VMWare.
[49]
Fred B. Schneider. 1990. Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319. https://doi.org/10.1145/98163.98167
[50]
Vyas Sekar, Norbert Egi, Sylvia Ratnasamy, Michael K Reiter, and Guangyu Shi. 2012. Design and implementation of a consolidated middlebox architecture. In NSDI 12. 323--336.
[51]
Justine Sherry, Peter Xiang Gao, Soumya Basu, Aurojit Panda, Arvind Krishnamurthy, Christian Maciocco, Maziar Manesh, Joao Martins, Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. 2015. Rollback-Recovery for Middleboxes. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 227--240. https://doi.org/10.1145/2785956.2787501
[52]
Robin Sommer, Matthias Vallentin, Lorenzo De Carli, and Vern Paxson. 2014. HILTI: An Abstract Execution Environment for Deep, Stateful Network Traffic Analysis. In Proceedings of the 2014Conference on Internet Measurement Conference (IMC '14). ACM, New York, NY, USA, 461--474. https://doi.org/10.1145/2663716.2663735
[53]
P. Srisuresh and K. Egevang. 2001. Traditional IP Network Address Translator (Traditional NAT). RFC 3022. RFC Editor. 1-16 pages. http://www.rfc-editor.org/rfc/rfc3022.txt
[54]
Rob Strom and Shaula Yemini. 1985. Optimistic Recovery in Distributed Systems. ACM Trans. Comput. Syst. 3, 3 (Aug. 1985), 204--226. https://doi.org/10.1145/3959.3962
[55]
The AWS Team. 2012. Summary of the October 22, 2012 AWS Service Event in the US-East Region. https://aws.amazon.com/message/680342/.
[56]
The Google Apps Team. 2012. Data Center Outages Generate Big Losses. http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/appsstatus/ir/plibxfjh8whr44h.pdf.
[57]
Daniel Turull, Peter Sjödin, and Robert Olsson. 2016. Pktgen: Measuring performance on high speed networks. Computer Communications 82 (2016), 39 - 48. https://doi.org/10.1016/j.comcom.2016.03.003
[58]
Robbert van Renesse and Fred B. Schneider. 2004. Chain Replication for Supporting High Throughput and Availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 7--7. http://dl.acm.org/citation.cfm?id=1251254.1251261
[59]
O. Huang M. Boucadair N. Leymann Z. Cao J. Hu W. Liu, H. Li. 2014. Service function chaining use-cases. https://tools.ietf.org/html/draft-liu-sfc-use-cases-01.

Cited By

View all
  • (2025)Towards cost optimization in security-aware service function chaining and embedding over multi-vendor edge networksComputer Networks10.1016/j.comnet.2024.111002257(111002)Online publication date: Feb-2025
  • (2024)Chain Segment Protection for Dependence-aware Service Function Chain in NFV2024 7th World Conference on Computing and Communication Technologies (WCCCT)10.1109/WCCCT60665.2024.10541465(80-85)Online publication date: 12-Apr-2024
  • (2024)SafeDRL: Dynamic Microservice Provisioning With Reliability and Latency Guarantees in Edge EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.332919473:1(235-248)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCOMM '20: Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication
July 2020
814 pages
ISBN:9781450379557
DOI:10.1145/3387514
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Middlebox Reliability
  2. Service Function Chain Fault Tolerance

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGCOMM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)8
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Towards cost optimization in security-aware service function chaining and embedding over multi-vendor edge networksComputer Networks10.1016/j.comnet.2024.111002257(111002)Online publication date: Feb-2025
  • (2024)Chain Segment Protection for Dependence-aware Service Function Chain in NFV2024 7th World Conference on Computing and Communication Technologies (WCCCT)10.1109/WCCCT60665.2024.10541465(80-85)Online publication date: 12-Apr-2024
  • (2024)SafeDRL: Dynamic Microservice Provisioning With Reliability and Latency Guarantees in Edge EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.332919473:1(235-248)Online publication date: Jan-2024
  • (2024)StateOS: Enabling Versatile Network Function Virtualization in Edge CloudsNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575285(1-9)Online publication date: 6-May-2024
  • (2024)Dependable Virtual Network Services: An Architecture for Fault-and Intrusion-tolerant SFCs2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN)10.1109/NFV-SDN61811.2024.10807480(1-6)Online publication date: 5-Nov-2024
  • (2024)Towards resources optimization in deploying service function chains with shared protectionComputer Networks10.1016/j.comnet.2024.110494(110494)Online publication date: May-2024
  • (2023)Availability-aware Provision of Service Function Chains in Mobile Edge ComputingACM Transactions on Sensor Networks10.1145/356548319:3(1-28)Online publication date: 1-Mar-2023
  • (2023)A Dynamic Service Identity-Based Security Policy Consistency Checking Mechanism in SDN2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics)10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics60724.2023.00034(59-64)Online publication date: 17-Dec-2023
  • (2023)Service Function Chaining and Embedding With Heterogeneous Faults Tolerance in Edge NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2022.322066720:3(2157-2171)Online publication date: Sep-2023
  • (2023)On the Game-Theoretic Analysis of Dynamic VNF Service Chaining in Edge-Cloud EONsJournal of Lightwave Technology10.1109/JLT.2023.323950141:10(2940-2952)Online publication date: 15-May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media