skip to main content
10.1145/3605573.3605646acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

On Optimizing Traffic Scheduling for Multi-replica Containerized Microservices

Published:13 September 2023Publication History

ABSTRACT

Containerized deployment of microservices has been becoming prevalent, as it provides flexible deployment and elastic resource configuration. For high concurrency and fault tolerance, multiple container replicas are often deployed for each microservice component, but this may induce heavy cross-machine traffic and degrades the performance of microservice applications. Traffic localization tries to put containers with heavy communication traffic on the same machine to reduce cross-machine traffic. However, it is still very common to have the containers with heavy traffic on different machines, especially under multi-replica deployment, due to the insufficient resources of a physical machine. To this end, we develop a network-aware scheduling system OptTraffic, which realizes optimized traffic scheduling for containerized microservices. OptTraffic estimates the traffic between each pair of containers in a lightweight manner by combining a simple math calculation with coarse-grained monitoring, then it proposes an efficient traffic allocation algorithm and leverages dynamic scheduling with multiple optimizations to minimize the cross-machine traffic without sacrificing resource usage balance. Experiments show that under multi-replica deployment, OptTraffic can save up to 47% of the network bandwidth, while reducing the P99 latency by 28%-45%, compared to Kubernetes and existing traffic localization designs for real-world microservice applications.

References

  1. 2023. Amazon Microservices. https://aws.amazon.com/microservices/.Google ScholarGoogle Scholar
  2. Marcelo Amaral, Tatsuhiro Chiba, Scott Trent, Takeshi Yoshimura, and Sunyanan Choochotkaew. 2022. MicroLens: A Performance Analysis Framework for Microservices Using Hidden Metrics With BPF. In IEEE CLOUD.Google ScholarGoogle Scholar
  3. Apple Microservices 2022. Apple Microservices. https://www.apple.com/.Google ScholarGoogle Scholar
  4. Ataollah Fatahi Baarzi and George Kesidis. 2021. SHOWAR: Right-Sizing And Efficient Scheduling of Microservices. In Proc. of the ACM SoCC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Liang Bao, Chase Wu, Xiaoxuan Bu, Nana Ren, and Mengqing Shen. 2019. Performance modeling and workflow scheduling of microservice-based applications in clouds. IEEE Trans. Parallel Distributed Syst. (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lianjie Cao and Puneet Sharma. 2021. Co-Locating Containerized Workload Using Service Mesh Telemetry. In Proc. of the ACM CoNEXT.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Quan Chen, Zhenning Wang, Jingwen Leng, Chao Li, Wenli Zheng, and Minyi Guo. 2019. Avalon: Towards QoS Awareness and Improved Utilization through Multi-Resource Management in Datacenters. In Proc. of the ACM ICS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, and Minyi Guo. 2020. Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds. In IEEE SC.Google ScholarGoogle Scholar
  9. Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. In Proc. of the ACM ASPLOS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In ACM ASPLOS.Google ScholarGoogle Scholar
  11. Docker Swarm 2022. Docker Swarm. https://docs.docker.com/engine/swarm/.Google ScholarGoogle Scholar
  12. eBPF 2023. The Linux Foundation.https://www.ebpf.io/.Google ScholarGoogle Scholar
  13. Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, and Minyi Guo. 2022. Adaptive Resource Efficient Microservice Deployment in Cloud-Edge Continuum. IEEE Trans. Parallel Distributed Syst. (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proc. of the ACM ASPLOS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proc. of the ACM EuroSys.Google ScholarGoogle Scholar
  16. Alim Ul Gias, Giuliano Casale, and Murray Woodside. 2019. ATOM: Model-Driven Autoscaling for Microservices. In Proc. of the IEEE ICDCS.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kavya Govindarajan, Chander Govindarajan, and Mudit Verma. 2022. Network Aware Container Orchestration for Telco Workloads. In IEEE CLOUD.Google ScholarGoogle Scholar
  18. Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM Allocation Service at Scale. In Proc. of the OSDI.Google ScholarGoogle Scholar
  19. iftop 2023. iftop. https://github.com/soarpenguin/iftop/.Google ScholarGoogle Scholar
  20. Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. In Proc. of the USENIX ATC.Google ScholarGoogle Scholar
  21. Istio 2022. Istio. https://istio.io/.Google ScholarGoogle Scholar
  22. Seyyed Ahmad Javadi, Amoghavarsha Suresh, Muhammad Wajahat, and Anshul Gandhi. 2019. Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments. In Proc. of the ACM SoCC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhipeng Jia and Emmett Witchel. 2021. Nightcore: Efficient and Scalable Serverless Computing for Latency-Sensitive, Interactive Microservices. Proc. of the ACM ASPLOS (2021).Google ScholarGoogle Scholar
  24. Shweta Khare, Hongyang Sun, Julien Gascon-Samson, Kaiwen Zhang, Aniruddha Gokhale, Yogesh Barve, Anirban Bhattacharjee, and Xenofon Koutsoukos. 2019. Linearize, Predict and Place: Minimizing the Makespan for Edge-Based Stream Processing of Directed Acyclic Graphs. In Proc. of the ACM/IEEE SEC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kubernetes 2022. Kubernetes. https://kubernetes.io/.Google ScholarGoogle Scholar
  26. Neeraj Kulkarni, Gonzalo Gonzalez-Pumariega, Amulya Khurana, Christine A. Shoemaker, Christina Delimitrou, and David H. Albonesi. 2020. CuttleSys: Data-Driven Resource Management for Interactive Services on Reconfigurable Multicores. In Proc. of the ACM/IEEE MICRO.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. 2021. Parallelizing Packet Processing in Container Overlay Networks. In ACM EuroSys.Google ScholarGoogle Scholar
  28. Suyi Li, Luping Wang, Wei Wang, Yinghao Yu, and Bo Li. 2021. George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints. In Proc. of the ACM SoCC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Cheng-Zhong Xu. 2022. An In-Depth Study of Microservice Call Graph and Runtime Performance. IEEE Trans. Parallel Distributed Syst. (2022).Google ScholarGoogle ScholarCross RefCross Ref
  30. Liang Lv, Yuchao Zhang, Yusen Li, Ke Xu, Dan Wang, Wendong Wang, Minghui Li, Xuan Cao, and Qingqing Liang. 2019. Communication-aware container placement and reassignment in large-scale internet data centers. IEEE JSAC (2019).Google ScholarGoogle ScholarCross RefCross Ref
  31. Kasper Grud Skat Madsen, Yongluan Zhou, and Jianneng Cao. 2017. Integrative Dynamic Reconfiguration in a Parallel Stream Processing Engine. In IEEE ICDE.Google ScholarGoogle Scholar
  32. Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016. Resource Management with Deep Reinforcement Learning. In ACM HotNets.Google ScholarGoogle Scholar
  33. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proc. of the ACM SIGCOMM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shanka Subhra Mondal, Nikhil Sheoran, and Subrata Mitra. 2021. Scheduling of Time-Varying Workloads Using Reinforcement Learning. AAAI (2021).Google ScholarGoogle Scholar
  35. Netflix Microservices 2022. Netflix Microservices. https://www.netflix.com/.Google ScholarGoogle Scholar
  36. Nginx 2022. Nginx. https://www.nginx.com/.Google ScholarGoogle Scholar
  37. Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, and Magnus Sjalander. 2020. Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services. In Proc. of the IEEE HPCA.Google ScholarGoogle ScholarCross RefCross Ref
  38. Open Shift 2022. Open Shift. https://www.redhat.com/en/technologies/cloud-computing/openshift.Google ScholarGoogle Scholar
  39. Pu Pang, Quan Chen, Deze Zeng, and Minyi Guo. 2021. Adaptive Preference-Aware Co-Location for Improving Resource Utilization of Power Constrained Datacenters. IEEE Trans. Parallel Distributed Syst. (2021).Google ScholarGoogle ScholarCross RefCross Ref
  40. Tirthak Patel and Devesh Tiwari. 2020. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers. In Proc. of the IEEE HPCA.Google ScholarGoogle ScholarCross RefCross Ref
  41. Prometheus 2022. Prometheus. https://prometheus.io/.Google ScholarGoogle Scholar
  42. Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In Proc. of the USENIX OSDI.Google ScholarGoogle Scholar
  43. Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proc. of the ACM SoCC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google. In Proc. of the ACM EuroSys.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, and Minyi Guo. 2021. QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling. In Proc. of the IEEE IPDPS.Google ScholarGoogle Scholar
  46. sockperf 2023. sockperf. https://github.com/Mellanox/sockperf.Google ScholarGoogle Scholar
  47. Akshitha Sriraman and Thomas F. Wenisch. 2018. μ Suite: A Benchmark Suite for Microservices. In Proc. of the IEEE IISWC.Google ScholarGoogle ScholarCross RefCross Ref
  48. Kun Suo, Yong Zhao, Wei Chen, and Jia Rao. 2018. An Analysis and Empirical Study of Container Networks. In Proc. of the IEEE INFOCOM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Cory Thoma, Alexandros Labrinidis, and Adam J. Lee. 2014. Automated operator placement in distributed Data Stream Management Systems subject to user constraints. In Proc. of the IEEE ICDEW. IEEE Computer Society.Google ScholarGoogle Scholar
  50. Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next Generation. In Proc. of the ACM EuroSys.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Sheng Wang, Zhijun Ding, and Changjun Jiang. 2021. Elastic Scheduling for Microservice Applications in Clouds. IEEE Trans. Parallel Distributed Syst. (2021).Google ScholarGoogle ScholarCross RefCross Ref
  52. Xinkai Wang, Chao Li, Lu Zhang, Xiaofeng Hou, Quan Chen, and Minyi Guo. 2022. Exploring Efficient Microservice Level Parallelism. In IEEE IPDPS.Google ScholarGoogle Scholar
  53. Xiaodong Wang and José F. Martínez. 2015. XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. In Proc. of the IEEE HPCA.Google ScholarGoogle ScholarCross RefCross Ref
  54. Łukasz Wojciechowski, Krzysztof Opasiak, Jakub Latusek, Maciej Wereski, Victor Morales, Taewan Kim, and Moonki Hong. 2021. NetMARKS: Network metrics-AwaRe kubernetes scheduler powered by service mesh. In IEEE INFOCOM.Google ScholarGoogle Scholar
  55. wrk2 2022. wrk2. https://github.com/giltene/wrk2.Google ScholarGoogle Scholar
  56. Zhaorui Wu, Yuhui Deng, Hao Feng, Yi Zhou, and Geyong Min. 2021. Blender: A traffic-aware container placement for containerized data centers. In IEEE DATE.Google ScholarGoogle Scholar
  57. Guoyao Xu Xu, Cheng-Zhong Xu, and Song Jiang. 2016. Prophet: Scheduling Executors with Time-Varying Resource Demands on Data-Parallel Computation Frameworks. In Proc. of the IEEE ICAC.Google ScholarGoogle ScholarCross RefCross Ref
  58. Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. 2016. FreeFlow: High Performance Container Networking. In Proc. of the ACM HotNets.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo. 2022. Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs. IEEE Trans. Comput. (2022).Google ScholarGoogle ScholarCross RefCross Ref
  60. Laiping Zhao, Yanan Yang, Kaixuan Zhang, Xiaobo Zhou, Tie Qiu, Keqiu Li, and Yungang Bao. 2020. Rhythm: Component-Distinguishable Workload Deployment in Datacenters. In Proc. of the ACM EuroSys.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Diyu Zhou and Yuval Tamir. 2022. RRC: Responsive Replicated Containers. In Proc. of the USENIX ATC.Google ScholarGoogle Scholar
  62. Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. 2019. Slim: OS Kernel Support for a Low-Overhead Container Overlay Network. In Proc. of the USENIX ATC.Google ScholarGoogle Scholar

Index Terms

  1. On Optimizing Traffic Scheduling for Multi-replica Containerized Microservices

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
      August 2023
      858 pages
      ISBN:9798400708435
      DOI:10.1145/3605573

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 September 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate91of313submissions,29%
    • Article Metrics

      • Downloads (Last 12 months)182
      • Downloads (Last 6 weeks)27

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format