skip to main content
10.1145/3651890.3672253acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices

Published: 04 August 2024 Publication History

Abstract

Microservice has become a de facto standard for building large-scale cloud applications. Overload control is essential in preventing microservice failures and maintaining system performance under overloads. Although several approaches have been proposed, they are limited to mitigating the overload of individual microservices, lacking assessments of interdependent microservices and APIs.
This paper presents TopFull, an adaptive overload control at entry for microservices that leverages global observations to maximize throughput that meets service level objectives (i.e., goodput). TopFull makes adaptive load control on a per-API basis, exercises parallel control on each independent subset of microservices, and applies RL-based rate controllers that adjust the admitted rates of the APIs at entry according to the severity of overload. Our experiments on various open-source benchmarks demonstrate that TopFull significantly increases goodput in overload scenarios, outperforming DAGOR by 1.82x and Breakwater by 2.26x. Furthermore, the Kubernetes autoscaler with TopFull serves up to 3.91x more requests under traffic surge and tolerates traffic spikes with up to 57% fewer resources than the standalone Kubernetes autoscaler.

References

[1]
2006. Avoiding a Success Disaster. https://aws.amazon.com/ko/blogs/aws/avoiding_a_succ/.
[2]
2015. Microservices at Amazon. https://www.slideshare.net/apigee/i-love-apis-2015-microservices-at-amazon-54487258.
[3]
2016. Canada's immigration website crashed due to traffic surge. https://www.ctvnews.ca/canada/canada-s-immigration-website-crashed-due-to-traffic-surge-1.3152744.
[4]
2017. Airbnb, From Monolith to Microservices: How to Scale Your Architecture. https://www.youtube.com/watch?v=N1BWMW9NEQc.
[5]
2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documents-what-caused-prime-day-crash-company-scramble.html.
[6]
2020. Microsoft Confirms March Azure Outage Due to COVID-19 Strains. https://visualstudiomagazine.com/articles/2020/04/13/azure-outage.aspx.
[7]
2020. Zoom suffers "partial outage" amid home working surge. https://www.datacenterdynamics.com/en/news/zoom-suffers-partial-outage-amid-home-working-surge/.
[8]
2020. "Google Down": How Users Experienced Google's Major Outage. https://www.semrush.com/blog/google-down-how-users-experienced-google-major-outage/.
[9]
2021. Microservices - Netflix Techblog. https://netflixtechblog.com/tagged/microservices.
[10]
2022. Handle partial failure. https://learn.microsoft.com/en-us/dotnet/architecture/microservices/implement-resilient-applications/handle-partial-failure.
[11]
2022. Horizontal Pod Autoscaler of Kubernetes. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.
[12]
2022. Train Ticket: A Benchmark Microservice System. https://github.com/FudanSELab/train-ticket.
[13]
2023. Dv5 and Dsv5-series. https://learn.microsoft.com/en-us/azure/virtual-machines/dv5-dsv5-series.
[14]
2024. 2022 Istio Authors. Version Istio 1.16.2. https://istio.io/latest/.
[15]
2024. AWS Health Dashboard. https://health.aws.amazon.com/health/status.
[16]
2024. Azure. https://azure.microsoft.com/.
[17]
2024. cAdvisor (Container Advisor). https://github.com/google/cadvisor.
[18]
2024. Cluster Autoscaler. https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html.
[19]
2024. gRPC: A high performance, open source universal RPC framework. https://grpc.io/.
[20]
2024. Locust: An open source load testing tool. https://locust.io/.
[21]
2024. Online Boutique by Google. https://github.com/GoogleCloudPlatform/microservices-demo.
[22]
2024. Production-Grade Container Orchestration. https://kubernetes.io/.
[23]
2024. RLlib: Industry-Grade Reinforcement Learning. https://docs.ray.io/en/latest/rllib/index.html.
[24]
2024. Source code for ray.rllib.algorithms.ppo.ppo. https://docs.ray.io/en/latest/_modules/ray/rllib/algorithms/ppo/ppo.html#PPOConfig.
[25]
Romil Bhardwaj, Kirthevasan Kandasamy, Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2023. Cilantro:{Performance-Aware} resource allocation for general objectives via online feedback. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 623--643.
[26]
Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2020. Overload Control for {μs-scale}{RPCs} with Breakwater. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 299--314.
[27]
Inho Cho, Ahmed Saeed, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2023. Protego: Overload Control for Applications with Unpredictable Lock Contention. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 725--738.
[28]
Byungkwon Choi, Jinwoo Park, Chunghan Lee, and Dongsu Han. 2021. pHPA: A Proactive Autoscaling Framework For Microservice Chain. In 5th Asia-Pacific Workshop on Networking (APNet 2021). 65--71.
[29]
Carl Doersch and Andrew Zisserman. 2019. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. Advances in Neural Information Processing Systems 32 (2019).
[30]
Alim Ul Gias, Giuliano Casale, and Murray Woodside. 2019. ATOM: Model-driven autoscaling for microservices. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1994--2004.
[31]
Jianwei Hao, Ting Jiang, Wei Wang, and In Kee Kim. 2021. An empirical analysis of VM startup times in public IaaS clouds. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). IEEE, 398--403.
[32]
Jiang Hua, Liangcai Zeng, Gongfa Li, and Zhaojie Ju. 2021. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors 21, 4 (2021), 1278.
[33]
Qian Li, Bin Li, Pietro Mercati, Ramesh Illikkal, Charlie Tai, Michael Kishinevsky, and Christos Kozyrakis. 2021. RAMBO: Resource Allocation for Microservices Using Bayesian Optimization. IEEE Computer Architecture Letters 20, 1 (2021), 46--49.
[34]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412--426.
[35]
Shutian Luo, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, Guodong Yang, and Chengzhong Xu. 2022. Erms: Efficient Resource Management for Shared Microservices with SLA Guarantees. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 62--77.
[36]
Ming Mao and Marty Humphrey. 2012. A performance study on the vm startup time in the cloud. In 2012 IEEE Fifth International Conference on Cloud Computing. IEEE, 423--430.
[37]
Justin J Meza, Thote Gowda, Ahmed Eid, Tomiwa Ijaware, Dmitry Chernyshev, Yi Yu, Md Nazim Uddin, Rohan Das, Chad Nachiappan, Sari Tran, et al. 2023. Defcon: Preventing Overload with Graceful Feature Degradation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 607--622.
[38]
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han. 2021. GRAF: a graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 154--167.
[39]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 805--825. https://www.usenix.org/conference/osdi20/presentation/qiu
[40]
Haoran Qiu, Weichao Mao, Chen Wang, Hubertus Franke, Alaa Youssef, Zbigniew T Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. 2023. {AWARE}: Automate workload autoscaling with reinforcement learning in production cloud systems. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 387--402.
[41]
Krzysztof Rzadca, Paweł Findeisen, Jacek Świderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Paweł Krzysztof Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google Scale. In Proceedings of the Fifteenth European Conference on Computer Systems.
[42]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[43]
Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20). 205--218.
[44]
Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin Ciucu. 2017. Distributed resource management across process boundaries. In Proceedings of the 2017 Symposium on Cloud Computing. 611--623.
[45]
Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, 7 (2009).
[46]
Midhul Vuppalapati, Giannis Fikioris, Rachit Agarwal, Asaf Cidon, Anurag Khandelwal, and Eva Tardos. 2023. Karma: Resource allocation for dynamic demands. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 645--662.
[47]
Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, and Francis Y Yan. 2024. Autothrottle: A Practical {Bi-Level} Approach to Resource Management for {SLO-Targeted} Microservices. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 149--165.
[48]
Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, KK Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, and Alex X Liu. 2022. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In Proceedings of the 13th Symposium on Cloud Computing. 16--30.
[49]
Z. Yang, P. Nguyen, H. Jin, and K. Nahrstedt. 2019. MIRAS: Model-based Reinforcement Learning for Microservice Resource Allocation over Scientific Workflows. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). 122--132.
[50]
Guangba Yu, Pengfei Chen, and Zibin Zheng. 2019. Microscaler: Automatic scaling for microservices with an online learning approach. In 2019 IEEE International Conference on Web Services (ICWS). IEEE, 68--75.
[51]
Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G Edward Suh, and Christina Delimitrou. 2021. Sinan: ML-based and QoS-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 167--181.
[52]
Yiwen Zhang, Gautam Kumar, Nandita Dukkipati, Xian Wu, Priyaranjan Jha, Mosharaf Chowdhury, and Amin Vahdat. 2022. Aequitas: admission control for performance-critical RPCs in datacenters. In Proceedings of the ACM SIGCOMM 2022 Conference. 1--18.
[53]
Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload control for scaling wechat microservices. In Proceedings of the ACM Symposium on Cloud Computing. 149--161.
[54]
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proc. IEEE 109, 1 (2020), 43--76.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
August 2024
1033 pages
ISBN:9798400706141
DOI:10.1145/3651890
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Badges

Author Tags

  1. microservices
  2. overload control
  3. quality of service
  4. resources optimization
  5. applied machine learning
  6. cloud computing

Qualifiers

  • Research-article

Funding Sources

Conference

ACM SIGCOMM '24
Sponsor:
ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
August 4 - 8, 2024
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,018
    Total Downloads
  • Downloads (Last 12 months)1,018
  • Downloads (Last 6 weeks)180
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media