skip to main content
10.1145/3651890.3672221acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Canal Mesh: A Cloud-Scale Sidecar-Free Multi-Tenant Service Mesh Architecture

Published: 04 August 2024 Publication History

Abstract

In recent years, service mesh frameworks have gained significant popularity in building microservice-based applications. A key component of these frameworks is a proxy in each K8s pod, named sidecar, which handles inter-pod traffic. Our empirical measurement reveals that such per-pod sidecars cause numerous problems, including intrusion into the user pod, excessive resource occupation, significant overhead in managing many sidecars, and performance degradation caused by passing traffic through the sidecar.
In this paper, we introduce Canal Mesh, a cloud-scale sidecar-free multi-tenant service mesh architecture. Canal decouples service mesh functions from the user cluster and deploys a centralized mesh gateway in the public cloud to handle these functions, thus reducing user intrusion and orchestration overhead. Through service consolidation and multi-tenancy, the infra costs of service mesh are also reduced. To address the rising issues due to cloud-based deployment, such as service availability, tenant isolation, noisy neighbor, service elasticity, and additional infra costs, we leverage techniques including hierarchical failure recovery, shuffle sharding, rapid intervention, precise scaling, cloud infra reuse and resource aggregation, etc. Our evaluation shows that Canal Mesh's performance, resource consumption, and control plane overhead are significantly better than Istio and Ambient. We also share experiences from years of deploying Istio and Canal in production.

References

[1]
2018. Istio Soft Multi-Tenancy Support. https://istio.io/v1. /2018/soft-multitenancy. (2018).
[2]
2021. Benchmarking Linkerd and Istio. https://linkerd.io/2021/05/27/linkerd-vs-istio-benchmarks/#latency-at-20-rps. (2021).
[3]
2021. How eBPF Streamlines the Service Mesh. https://thenewstack.io/how-ebpf-streamlines-the-service-mesh. (2021).
[4]
2021. How eBPF will solve Service Mesh - Goodbye Sidecars. https://isovalent.com/blog/post/2021-12-08-ebpf-servicemesh/#sidecar-vs-per-node-proxy. (2021).
[5]
2021. Sidecar Concept In 3 Minutes. https://medium.com/code-factory-berlin/sidecar-concept-in-2-minutes-a9f834cffe6f. (2021).
[6]
2022. A Comprehensive Guide to Canary Releases. https://www.getambassador.io/blog/comprehensive-guide-to-canary-releases. (2022).
[7]
2022. Ambient version. https://gcsweb.istio.io/gcs/istio-build/dev/0.0.0-ambient.191fe680b52c1754ee72a06b3e0d3f9d116f2e82. (2022).
[8]
2022. CNCF 2022 Annual Survey. https://www.cncf.io/reports/cncf-annual-survey-2022. (2022).
[9]
2022. Introducing Ambient Mesh. https://istio.io/latest/blog/2022/introducing-ambient-mesh. (2022).
[10]
2022. Traffic types and iptables rules in Istio sidecar explained. https://tetrate.io/traffic-types-and-iptables-rules-in-istio-sidecar-explained. (2022).
[11]
2023. Accelerate Microservice Networking Performance with 4th Gen Intel Xeon Scalable Processor. https://networkbuilders.intel.com/solutionslibrary/microservices-solution-optimizations-with-intel-xeon-scalable-processor-solution-brief. (2023).
[12]
2023. Cilium Service Mesh - Everything You Need to Know. https://isovalent.com/blog/post/cilium-service-mesh. (2023).
[13]
2023. Elastic Compute Service price in Ali cloud. https://www.alibabacloud.com/en/product. (2023).
[14]
2023. Getting Started with Multi-tenancy and Routing Delegation in Gloo Platform. https://www.solo.io/blog/multi-tenancy-routing-gloo-gateway. (2023).
[15]
2023. Istio 1.17.0. https://github.com/istio/istio/releases/tag/1.17.0. (2023).
[16]
2023. Istio Ambient Waypoint Proxy Made Simple. https://istio.io/latest/blog/2023/waypoint-proxy-made-simple. (2023).
[17]
2024. Accelerating OpenSSL Using Intel QuickAssist Technology. https://www.intel.com/content/dam/www/public/us/en/documents/solution-briefs/accelerating-openssl-brief.pdf. (2024).
[18]
2024. Addressing Cascading Failures. https://sre.google/sre-book/addressing-cascading-failures/. (2024).
[19]
2024. Alibaba Cloud Service Mesh. https://www.alibabacloud.com/product/servicemesh. (2024).
[20]
2024. AWS App Mesh. https://aws.amazon.com/app-mesh. (2024).
[21]
2024. Azure Service Fabric. https://azure.microsoft.com/en-us/products/service-fabric. (2024).
[22]
2024. Calico CNI plugin. https://github.com/projectcalico/calico. (2024).
[23]
2024. Cilium Service Mesh. https://cilium.io/use-cases/service-mesh. (2024).
[24]
2024. Crypto Accelerations in Istio and Envoy with Intel Xeon Scalable Processors. https://networkbuilders.intel.com/solutionslibrary/service-mesh-crypto-accelerations-istio-envoy-intel-xeon-sp-user-guide. (2024).
[25]
2024. eBPF. https://ebpf.io. (2024).
[26]
2024. Envoy is an open source edge and service proxy, designed for cloud-native applications. https://www.envoyproxy.io. (2024).
[27]
2024. Flannel CNI plugin. https://github.com/flannel-io/flannel. (2024).
[28]
2024. Google Cloud Service Mesh. https://cloud.google.com/products/service-mesh. (2024).
[29]
2024. Iptables Redirection. https://release-v1-2.docs.openservicemesh.io/docs/guides/traffic_management/iptables_redirection. (2024).
[30]
2024. Istio: simplify observability, traffic management, security, and policy with the leading service mesh. https://istio.io. (2024).
[31]
2024. Kubernetes. https://kubernetes.io. (2024).
[32]
2024. Linkerd: the world's most advanced service mesh. https://linkerd.io. (2024).
[33]
2024. NAT Gateway. https://www.alibabacloud.com/product/nat. (2024).
[34]
2024. Netperf. https://github.com/HewlettPackard/netperf. (2024).
[35]
2024. Overview of ENIs. https://www.alibabacloud.com/help/en/ecs/user-guide/overview-48. (2024).
[36]
2024. Server Load Balancer. https://www.alibabacloud.com/product/server-load-balancer. (2024).
[37]
2024. Sidecar Containers in Kubernetes Pods. https://www.baeldung.com/linux/kubernetes-pods-sidecar-containers. (2024).
[38]
2024. What is Istio Ambient Mode? https://www.solo.io/topics/istio/ambient-mode. (2024).
[39]
2024. Workload isolation using shuffle-sharding. https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding. (2024).
[40]
2024. wrk - a HTTP benchmarking tool. https://github.com/wg/wrk. (2024).
[41]
Gianni Antichi and Gábor Rétvári. 2020. Full-stack SDN: The next big challenge?. In Proceedings of the Symposium on SDN Research. 48--54.
[42]
Sachin Ashok, P Brighten Godfrey, and Radhika Mittal. 2021. Leveraging service meshes as a new network layer. In Proceedings of the 20th ACM Workshop on Hot Topics in Networks. 229--236.
[43]
Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, James Grantham, Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, et al. 2023. Disaggregating stateful network functions. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1469--1487.
[44]
Rajdeep Bhanot and Rahul Hans. 2015. A review and comparative analysis of various encryption algorithms. International Journal of Security and Its Applications 9, 4 (2015), 289--306.
[45]
Karthikeyan Bhargavan, Ioana Boureanu, Pierre-Alain Fouque, Cristina Onete, and Benjamin Richard. 2017. Content delivery over TLS: a cryptographic analysis of keyless SSL. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 1--16.
[46]
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.
[47]
Lee Calcote and Zack Butcher. 2019. Istio: Up and running: Using a service mesh to connect, secure, control, and observe. O'Reilly Media.
[48]
Lianjie Cao and Puneet Sharma. 2021. Co-locating containerized workload using service mesh telemetry. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 168--174.
[49]
Boutheina Dab, Ilhem Fajjari, Mathieu Rohon, Cyril Auboin, and Arnaud Diquélou. 2020. Cloud-native service function chaining for 5G based on network service mesh. In ICC 2020-2020 IEEE International Conference On Communications (ICC). IEEE, 1--7.
[50]
Kevin Dackow, Andrew Wagner, Tim Nelson, Shriram Krishnamurthi, and Theophilus A Benson. 2020. Solver-Aided Multi-Party Configuration. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks. 103--109.
[51]
João Tiago Duarte Maia and Filipe Figueiredo Correia. 2022. Service mesh patterns. In Proceedings of the 27th European Conference on Pattern Languages of Programs. 1--12.
[52]
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, et al. 2015. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 139--152.
[53]
C Hopps. 2000. RFC 2992: Analysis of an Equal-Cost Multi-Path Algorithm. (2000).
[54]
Haan Johng, Anup K Kalia, Jin Xiao, Maja Vuković, and Lawrence Chung. 2019. Harmonia: A continuous service monitoring framework using devops and service mesh in a complementary manner. In Service-Oriented Computing: 17th International Conference, ICSOC 2019, Toulouse, France, October 28--31, 2019, Proceedings 17. Springer, 151--168.
[55]
Matt Klein. 2017. Lyft's Envoy: Experiences Operating a Large Service Mesh. USENIX Association, San Francisco, CA.
[56]
M Mahalingam, D Dutt, K Duda, P Agarwal, L Kreeger, T Sridhar, M Bursell, and C Wright. 2014. RFC 7348: Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks. (2014).
[57]
John Nagle. 1984. RFC 896: Congestion Control in IP/TCP Internetworks. (1984).
[58]
Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu. 2018. Stateless datacenter load-balancing with beamer. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 125--139.
[59]
Tian Pan, Kun Liu, Xionglie Wei, Yisong Qiao, Jun Hu, Zhiguo Li, Jun Liang, Tiesheng Cheng, Wenqiang Su, Jie Lu, et al. 2024. {LuoShen}: A {Hyper-Converged} Programmable Gateway for {Multi-Tenant}{Multi-Service} Edge Clouds. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 877--892.
[60]
Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, et al. 2021. Sailfish: Accelerating cloud-scale multi-tenant multi-service gateways with programmable switches. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 194--206.
[61]
Shixiong Qi, Leslie Monis, Ziteng Zeng, Ian-chin Wang, and KK Ramakrishnan. 2022. SPRIGHT: extracting the server from serverless computing! highperformance eBPF-based event-driven, shared-memory processing. In Proceedings of the ACM SIGCOMM 2022 Conference. 780--794.
[62]
Eric Rescorla. 2018. RFC 8446: The Transport Layer Security (TLS) Protocol Version 1.3. (2018).
[63]
Harshit Saokar, Soteris Demetriou, Nick Magerko, Max Kontorovich, Josh Kirstein, Margot Leibold, Dimitrios Skarlatos, Hitesh Khandelwal, and Chunqiang Tang. 2023. {ServiceRouter}: Hyperscale and Minimal Cost Service Mesh at Meta. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 969--985.
[64]
George J Stigler. 1958. The economies of scale. The Journal of Law and Economics 1 (1958), 54--71.
[65]
Chengkun Wei, Xing Li, Ye Yang, Xiaochong Jiang, Tianyu Xu, Bowen Yang, Taotao Wu, Chao Xu, Yilong Lv, Haifeng Gao, et al. 2023. Achelous: Enabling Programmability, Elasticity, and Reliability in Hyperscale Cloud Networks. In Proceedings of the ACM SIGCOMM 2023 Conference. 769--782.
[66]
Łukasz Wojciechowski, Krzysztof Opasiak, Jakub Latusek, Maciej Wereski, Victor Morales, Taewan Kim, and Moonki Hong. 2021. Netmarks: Network metrics-aware kubernetes scheduler powered by service mesh. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--9.
[67]
Shunmin Zhu, Jianyuan Lu, Biao Lyu, Tian Pan, Chenhao Jia, Xin Cheng, Daxiang Kang, Yilong Lv, Fukun Yang, Xiaobo Xue, et al. 2022. Zoonet: a proactive telemetry system for large-scale cloud networks. In Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies. 321--336.
[68]
Xiangfeng Zhu, Weixin Deng, Banruo Liu, Jingrong Chen, Yongji Wu, Thomas Anderson, Arvind Krishnamurthy, Ratul Mahajan, and Danyang Zhuo. 2023. Application Defined Networks. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks. 87--94.
[69]
Xiangfeng Zhu, Guozhen She, Bowen Xue, Yu Zhang, Yongsu Zhang, Xuan Kelvin Zou, XiongChun Duan, Peng He, Arvind Krishnamurthy, Matthew Lentz, et al. 2023. Dissecting overheads of service mesh sidecars. In Proceedings of the 2023 ACM Symposium on Cloud Computing. 142--157.

Cited By

View all
  • (2024)SURE: Secure Unikernels Make Serverless Computing Rapid and EfficientProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698558(668-688)Online publication date: 20-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
August 2024
1033 pages
ISBN:9798400706141
DOI:10.1145/3651890
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Author Tags

  1. microservice
  2. service mesh
  3. sidecar
  4. public cloud
  5. multi-tenancy
  6. service consolidation
  7. centralized mesh gateway

Qualifiers

  • Research-article

Funding Sources

  • Key R&D Program of Zhejiang Province

Conference

ACM SIGCOMM '24
Sponsor:
ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
August 4 - 8, 2024
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,755
  • Downloads (Last 6 weeks)73
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SURE: Secure Unikernels Make Serverless Computing Rapid and EfficientProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698558(668-688)Online publication date: 20-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media