research-article

Open access

TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices

Authors:

Jaehyeong Park,

Dongsu HanAuthors Info & Claims

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Pages 876 - 890

https://doi.org/10.1145/3651890.3672253

Published: 04 August 2024 Publication History

Abstract

Microservice has become a de facto standard for building large-scale cloud applications. Overload control is essential in preventing microservice failures and maintaining system performance under overloads. Although several approaches have been proposed, they are limited to mitigating the overload of individual microservices, lacking assessments of interdependent microservices and APIs.

This paper presents TopFull, an adaptive overload control at entry for microservices that leverages global observations to maximize throughput that meets service level objectives (i.e., goodput). TopFull makes adaptive load control on a per-API basis, exercises parallel control on each independent subset of microservices, and applies RL-based rate controllers that adjust the admitted rates of the APIs at entry according to the severity of overload. Our experiments on various open-source benchmarks demonstrate that TopFull significantly increases goodput in overload scenarios, outperforming DAGOR by 1.82x and Breakwater by 2.26x. Furthermore, the Kubernetes autoscaler with TopFull serves up to 3.91x more requests under traffic surge and tolerates traffic spikes with up to 57% fewer resources than the standalone Kubernetes autoscaler.

References

[1]

2006. Avoiding a Success Disaster. https://aws.amazon.com/ko/blogs/aws/avoiding_a_succ/.

[2]

2015. Microservices at Amazon. https://www.slideshare.net/apigee/i-love-apis-2015-microservices-at-amazon-54487258.

[3]

2016. Canada's immigration website crashed due to traffic surge. https://www.ctvnews.ca/canada/canada-s-immigration-website-crashed-due-to-traffic-surge-1.3152744.

[4]

2017. Airbnb, From Monolith to Microservices: How to Scale Your Architecture. https://www.youtube.com/watch?v=N1BWMW9NEQc.

[5]

2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documents-what-caused-prime-day-crash-company-scramble.html.

[6]

2020. Microsoft Confirms March Azure Outage Due to COVID-19 Strains. https://visualstudiomagazine.com/articles/2020/04/13/azure-outage.aspx.

[7]

2020. Zoom suffers "partial outage" amid home working surge. https://www.datacenterdynamics.com/en/news/zoom-suffers-partial-outage-amid-home-working-surge/.

[8]

2020. "Google Down": How Users Experienced Google's Major Outage. https://www.semrush.com/blog/google-down-how-users-experienced-google-major-outage/.

[9]

2021. Microservices - Netflix Techblog. https://netflixtechblog.com/tagged/microservices.

[10]

2022. Handle partial failure. https://learn.microsoft.com/en-us/dotnet/architecture/microservices/implement-resilient-applications/handle-partial-failure.

[11]

2022. Horizontal Pod Autoscaler of Kubernetes. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.

[12]

2022. Train Ticket: A Benchmark Microservice System. https://github.com/FudanSELab/train-ticket.

[13]

2023. Dv5 and Dsv5-series. https://learn.microsoft.com/en-us/azure/virtual-machines/dv5-dsv5-series.

[14]

2024. 2022 Istio Authors. Version Istio 1.16.2. https://istio.io/latest/.

[15]

2024. AWS Health Dashboard. https://health.aws.amazon.com/health/status.

[16]

2024. Azure. https://azure.microsoft.com/.

[17]

2024. cAdvisor (Container Advisor). https://github.com/google/cadvisor.

[18]

2024. Cluster Autoscaler. https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html.

[19]

2024. gRPC: A high performance, open source universal RPC framework. https://grpc.io/.

[20]

2024. Locust: An open source load testing tool. https://locust.io/.

[21]

2024. Online Boutique by Google. https://github.com/GoogleCloudPlatform/microservices-demo.

[22]

2024. Production-Grade Container Orchestration. https://kubernetes.io/.

[23]

2024. RLlib: Industry-Grade Reinforcement Learning. https://docs.ray.io/en/latest/rllib/index.html.

[24]

2024. Source code for ray.rllib.algorithms.ppo.ppo. https://docs.ray.io/en/latest/_modules/ray/rllib/algorithms/ppo/ppo.html#PPOConfig.

[25]

Romil Bhardwaj, Kirthevasan Kandasamy, Asim Biswal, Wenshuo Guo, Benjamin Hindman, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2023. Cilantro:{Performance-Aware} resource allocation for general objectives via online feedback. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 623--643.

[26]

Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2020. Overload Control for {μs-scale}{RPCs} with Breakwater. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 299--314.

[27]

Inho Cho, Ahmed Saeed, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2023. Protego: Overload Control for Applications with Unpredictable Lock Contention. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 725--738.

[28]

Byungkwon Choi, Jinwoo Park, Chunghan Lee, and Dongsu Han. 2021. pHPA: A Proactive Autoscaling Framework For Microservice Chain. In 5th Asia-Pacific Workshop on Networking (APNet 2021). 65--71.

Digital Library

[29]

Carl Doersch and Andrew Zisserman. 2019. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. Advances in Neural Information Processing Systems 32 (2019).

[30]

Alim Ul Gias, Giuliano Casale, and Murray Woodside. 2019. ATOM: Model-driven autoscaling for microservices. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1994--2004.

[31]

Jianwei Hao, Ting Jiang, Wei Wang, and In Kee Kim. 2021. An empirical analysis of VM startup times in public IaaS clouds. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). IEEE, 398--403.

[32]

Jiang Hua, Liangcai Zeng, Gongfa Li, and Zhaojie Ju. 2021. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors 21, 4 (2021), 1278.

[33]

Qian Li, Bin Li, Pietro Mercati, Ramesh Illikkal, Charlie Tai, Michael Kishinevsky, and Christos Kozyrakis. 2021. RAMBO: Resource Allocation for Microservices Using Bayesian Optimization. IEEE Computer Architecture Letters 20, 1 (2021), 46--49.

[34]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412--426.

Digital Library

[35]

Shutian Luo, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, Guodong Yang, and Chengzhong Xu. 2022. Erms: Efficient Resource Management for Shared Microservices with SLA Guarantees. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 62--77.

Digital Library

[36]

Ming Mao and Marty Humphrey. 2012. A performance study on the vm startup time in the cloud. In 2012 IEEE Fifth International Conference on Cloud Computing. IEEE, 423--430.

Digital Library

[37]

Justin J Meza, Thote Gowda, Ahmed Eid, Tomiwa Ijaware, Dmitry Chernyshev, Yi Yu, Md Nazim Uddin, Rohan Das, Chad Nachiappan, Sari Tran, et al. 2023. Defcon: Preventing Overload with Graceful Feature Degradation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 607--622.

[38]

Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han. 2021. GRAF: a graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 154--167.

Digital Library

[39]

Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 805--825. https://www.usenix.org/conference/osdi20/presentation/qiu

[40]

Haoran Qiu, Weichao Mao, Chen Wang, Hubertus Franke, Alaa Youssef, Zbigniew T Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. 2023. {AWARE}: Automate workload autoscaling with reinforcement learning in production cloud systems. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 387--402.

[41]

Krzysztof Rzadca, Paweł Findeisen, Jacek Świderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Paweł Krzysztof Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google Scale. In Proceedings of the Fifteenth European Conference on Computer Systems.

Digital Library

[42]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[43]

Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20). 205--218.

[44]

Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin Ciucu. 2017. Distributed resource management across process boundaries. In Proceedings of the 2017 Symposium on Cloud Computing. 611--623.

Digital Library

[45]

Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, 7 (2009).

[46]

Midhul Vuppalapati, Giannis Fikioris, Rachit Agarwal, Asaf Cidon, Anurag Khandelwal, and Eva Tardos. 2023. Karma: Resource allocation for dynamic demands. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 645--662.

[47]

Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, and Francis Y Yan. 2024. Autothrottle: A Practical {Bi-Level} Approach to Resource Management for {SLO-Targeted} Microservices. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 149--165.

[48]

Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, KK Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, and Alex X Liu. 2022. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In Proceedings of the 13th Symposium on Cloud Computing. 16--30.

Digital Library

[49]

Z. Yang, P. Nguyen, H. Jin, and K. Nahrstedt. 2019. MIRAS: Model-based Reinforcement Learning for Microservice Resource Allocation over Scientific Workflows. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). 122--132.

[50]

Guangba Yu, Pengfei Chen, and Zibin Zheng. 2019. Microscaler: Automatic scaling for microservices with an online learning approach. In 2019 IEEE International Conference on Web Services (ICWS). IEEE, 68--75.

[51]

Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G Edward Suh, and Christina Delimitrou. 2021. Sinan: ML-based and QoS-aware resource management for cloud microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 167--181.

Digital Library

[52]

Yiwen Zhang, Gautam Kumar, Nandita Dukkipati, Xian Wu, Priyaranjan Jha, Mosharaf Chowdhury, and Amin Vahdat. 2022. Aequitas: admission control for performance-critical RPCs in datacenters. In Proceedings of the ACM SIGCOMM 2022 Conference. 1--18.

Digital Library

[53]

Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload control for scaling wechat microservices. In Proceedings of the ACM Symposium on Cloud Computing. 149--161.

Digital Library

[54]

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proc. IEEE 109, 1 (2020), 43--76.

Index Terms

TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
      2. n-tier architectures
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Overload Control for Scaling WeChat Microservices
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally the design of overload control is ad-hoc for individual service. However, service-specific overload control could ...
Architecting Serverless Microservices on the Cloud with AWS
SIGCSE '19: Proceedings of the 50th ACM Technical Symposium on Computer Science Education

A microservice architecture decomposes the entire functionally of an application into a set of services that can be deployed and scaled independently. Each service does only one job and does it well. Thus, it's simpler to develop, test and maintain. ...
Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Cloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

August 2024

1033 pages

ISBN:9798400706141

DOI:10.1145/3651890

Co-chairs:
Aruna Seneviratne,
Darryl Veitch,
Program Co-chairs:
Vyas Sekar,
Minlan Yu

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea
Institute of Information & Communications Technology Planning & Evaluation

Conference

ACM SIGCOMM '24

Sponsor:

SIGCOMM

ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference

August 4 - 8, 2024

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,018
Total Downloads

Downloads (Last 12 months)1,018
Downloads (Last 6 weeks)180

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten