skip to main content
10.1145/3620678.3624646acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Lifting the Fog of Uncertainties: Dynamic Resource Orchestration for the Containerized Cloud

Published: 31 October 2023 Publication History

Abstract

The advances in virtualization technologies have sparked a growing transition from virtual machine (VM)-based to container-based infrastructure for cloud computing. From the resource orchestration perspective, containers' lightweight and highly configurable nature not only enables opportunities for more optimized strategies, but also poses greater challenges due to additional uncertainties and a larger configuration parameter search space. Towards this end, we propose Drone, a resource orchestration framework that adaptively configures resource parameters to improve application performance and reduce operational cost in the presence of cloud uncertainties. Built on Contextual Bandit techniques, Drone is able to achieve a balance between performance and resource cost on public clouds, and optimize performance on private clouds where a hard resource constraint is present. We show that our algorithms can achieve sub-linear growth in cumulative regret, a theoretically sound convergence guarantee, and our extensive experiments show that Drone achieves an up to 45% performance improvement and a 20% resource footprint reduction across batch processing jobs and microservice workloads.

References

[1]
2022. Archive Team: The Twitter Stream Grab. https://archive.org/details/twitterstream/.
[2]
2023. Amazon EC2 Instance Types -- Amazon Web Services (AWS). https://aws.amazon.com/ec2/instance-types/.
[3]
2023. Amazon EC2 Spot Instances Pricing -- Amazon Web Services (AWS). https://aws.amazon.com/ec2/spot/pricing/.
[4]
2023. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/products/virtual-machines/spot/.
[5]
2023. Burstable performance instances -- Amazon Web Services (AWS). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html.
[6]
2023. Limit a container's resources | Docker Documentation. https://docs.docker.com/config/containers/resource_constraints/.
[7]
2023. Resource Management for Pods and Containers | Kubernetes Documentation. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/.
[8]
2023. Spot VMs - Google Cloud Platform. https://cloud.google.com/spot-vms.
[9]
Ziad A Al-Sharif, Yaser Jararweh, Ahmad Al-Dahoud, and Luay M Alawneh. 2017. ACCRS: autonomic based cloud computing resource scaling. Cluster Computing 20, 3 (2017), 2479--2488.
[10]
Ahmed Ali-Eldin, Johan Tordsson, and Erik Elmroth. 2012. An adaptive hybrid elasticity controller for cloud infrastructures. In 2012 IEEE Network Operations and Management Symposium. IEEE, 204--212.
[11]
Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In NSDI, Vol. 2. 4--2.
[12]
Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. 2020. Regret bound for safe gaussian process bandit optimization. In Learning for Dynamics and Control. PMLR, 158--159.
[13]
Kubernetes Autoscalers. 2023. https://github.com/kubernetes/autoscaler.
[14]
Amazon aws autoscaling service. 2021. https://aws.amazon.com/autoscaling/.
[15]
Microsoft azure autoscaler. 2021. https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-how-to-scale-portal/.
[16]
Ataollah Fatahi Baarzi and George Kesidis. 2021. Showar: Right-sizing and efficient scheduling of microservices. In Proceedings of the ACM Symposium on Cloud Computing. 427--441.
[17]
Susan Baldwin. 2012. Compute Canada: advancing computational research. In Journal of Physics: Conference Series, Vol. 341. IOP Publishing, 012001.
[18]
CloudSort Benchmark. 2022. http://sortbenchmark.org/.
[19]
Peter Bodík, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, and David Patterson. 2009. Statistical machine learning makes automatic control practical for internet datacenters. In Proceedings of the 2009 conference on Hot topics in cloud computing. USENIX Association, 12.
[20]
Michael Borkowski, Stefan Schulte, and Christoph Hochreiner. 2016. Predicting cloud resource utilization. In Proceedings of the 9th International Conference on Utility and Cloud Computing. 37--42.
[21]
Stefano Cereda, Stefano Valladares, Paolo Cremonesi, and Stefano Doni. 2021. Cgptuner: a contextual gaussian process bandit approach for the automatic tuning of it configurations under varying workload conditions. Proceedings of the VLDB Endowment 14, 8 (2021), 1401--1413.
[22]
Tianyi Chen, Qing Ling, and Georgios B Giannakis. 2017. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing 65, 24 (2017), 6350--6364.
[23]
Yue Cheng, Ali Anwar, and Xuejing Duan. 2018. Analyzing alibaba's co-located datacenter workloads. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 292--297.
[24]
Ayman Chouayakh and Apostolos Destounis. 2022. Towards no regret with no service outages in online resource allocation for edge computing. In ICC 2022-IEEE International Conference on Communications. IEEE, 4378--4383.
[25]
Sayak Ray Chowdhury and Aditya Gopalan. 2017. On kernelized multi-armed bandits. In International Conference on Machine Learning. PMLR, 844--853.
[26]
Google cloud compute engine autoscaler. 2021. https://cloud.google.com/compute/docs/autoscaler/.
[27]
Google cloud compute engine resource-based pricing. 2021. https://cloud.google.com/compute/resource-based-pricing.
[28]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.
[29]
Simon Eismann, Cor-Paul Bezemer, Weiyi Shang, Dušan Okanović, and André van Hoorn. 2020. Microservices: A performance tester's dream or nightmare?. In Proceedings of the ACM/SPEC International Conference on Performance Engineering. 138--149.
[30]
Armando Fox, Rean Griffith, Anthony Joseph, Randy Katz, Andrew Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al. 2009. Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28, 13 (2009), 2009.
[31]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3--18.
[32]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems. 19--33.
[33]
Mostafa Ghobaei-Arani, Sam Jabbehdari, and Mohammad Ali Pourmina. 2018. An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach. Future Generation Computer Systems 78 (2018), 191--210.
[34]
Alim Ul Gias, Giuliano Casale, and Murray Woodside. 2019. ATOM: Model-driven autoscaling for microservices. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1994--2004.
[35]
Domenico Grimaldi, Valerio Persico, Antonio Pescapé, Alessandro Salvi, and Stefania Santini. 2015. A feedback-control approach for resource management in public clouds. In 2015 IEEE Global Communications Conference (GLOBECOM). IEEE, 1--7.
[36]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 620--629.
[37]
Just how big is Amazon's AWS business? (hint: it's absolutely massive). 2014. https://web.archive.org/web/20191223045710/https://www.geek.com/chips/just-how-big-is-amazons-aws-business-hint-its-absolutely-massive-1610221/.
[38]
Chin-Jung Hsu, Vivek Nair, Vincent W Freeh, and Tim Menzies. 2018. Arrow: Low-level augmented bayesian optimization for finding the best cloud vm. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 660--670.
[39]
Weaveworks Inc. 2021. Sock Shop. https://github.com/microservices-demo/microservices-demo.
[40]
Jing Jiang, Jie Lu, Guangquan Zhang, and Guodong Long. 2013. Optimal cloud resource auto-scaling for web applications. In 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, 58--65.
[41]
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, et al. 2019. Cloud programming simplified: A berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).
[42]
H. M. Dipu Kabir, Abbas Khosravi, Subrota K. Mondal, Mustaneer Rahman, Saeid Nahavandi, and Rajkumar Buyya. 2021. Uncertainty-Aware Decisions in Cloud Computing: Foundations and Future Directions. ACM Comput. Surv. 54, 4, Article 74 (may 2021), 30 pages. https://doi.org/10.1145/3447583
[43]
Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019. 1--16.
[44]
Matthias Keller and Holger Karl. 2014. Response time-optimized distributed cloud resource allocation. In Proceedings of the 2014 ACM SIGCOMM workshop on Distributed cloud computing. 47--52.
[45]
Andreas Krause and Cheng Ong. 2011. Contextual gaussian process bandit optimization. Advances in neural information processing systems 24 (2011).
[46]
Qian Li, Bin Li, Pietro Mercati, Ramesh Illikkal, Charlie Tai, Michael Kishinevsky, and Christos Kozyrakis. 2021. RAMBO: Resource allocation for microservices using Bayesian optimization. IEEE Computer Architecture Letters 20, 1 (2021), 46--49.
[47]
Nikolaos Liakopoulos, Georgios Paschos, and Thrasyvoulos Spyropoulos. 2019. No regret in cloud resources reservation with violation guarantees. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1747--1755.
[48]
Yang Liu, Huanle Xu, and Wing Cheong Lau. 2019. Accordia: Adaptive cloud configuration optimization for recurring data-intensive applications. In Proceedings of the ACM Symposium on Cloud Computing. 479--479.
[49]
Chengzhi Lu, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, and Chengzhong Xu. 2023. Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud Platforms. In Proceedings of the Eighteenth European Conference on Computer Systems. 416--432.
[50]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412--426.
[51]
Shutian Luo, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, and Chengzhong Xu. 2022. The power of prediction: microservice auto scaling via workload learning. In Proceedings of the 13th Symposium on Cloud Computing. 355--369.
[52]
Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, et al. 2019. Park: An open platform for learning-augmented computer systems. Advances in Neural Information Processing Systems 32 (2019).
[53]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication. 270--288.
[54]
Ryan Marcus and Olga Papaemmanouil. 2017. Releasing Cloud Databases for the Chains of Performance Prediction Models. In CIDR.
[55]
Stock Market Data Nifty 100 Stocks (1 min) data. 2022. https://www.kaggle.com/datasets/debashis74017/stock-market-data-nifty-50-stocks-1-min-data.
[56]
Riccardo Moriconi, Marc Peter Deisenroth, and KS Sesh Kumar. 2020. High-dimensional Bayesian optimization using low-dimensional feature spaces. Machine Learning 109 (2020), 1925--1943.
[57]
Prometheus node exporter. 2023. https://github.com/prometheus/node_exporter.
[58]
Spark on-k8s-operator by Google Cloud Platform. 2023. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.
[59]
Flink on Kubernetes. 2023. https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/.
[60]
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han. 2021. GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 154--167.
[61]
Prometheus. 2023. https://prometheus.io/.
[62]
Haoran Qiu, Subho S Banerjee, Saurabh Jha, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2020. FIRM: An intelligent fine-grained resource management framework for slo-oriented microservices. In Proceedings of The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20).
[63]
Fabiana Rossi, Matteo Nardelli, and Valeria Cardellini. 2019. Horizontal and vertical scaling of container-based applications using reinforcement learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 329--338.
[64]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, et al. 2020. Autopilot: workload autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems. 1--16.
[65]
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110--2121.
[66]
Aleksandrs Slivkins et al. 2019. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning 12, 1-2 (2019), 1--286.
[67]
Stephen Soltesz, Herbert Pötzl, Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. 2007. Container-Based Operating System Virtualization: A Scalable, High-Performance Alternative to Hypervisors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. ACM, Lisbon Portugal, 275--287. https://doi.org/10.1145/1272996.1273025
[68]
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2009. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995 (2009).
[69]
Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. 2015. Safe exploration for optimization with Gaussian processes. In International conference on machine learning. PMLR, 997--1005.
[70]
Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. 2018. Stagewise safe bayesian optimization with gaussian processes. In International conference on machine learning. PMLR, 4781--4789.
[71]
Lubos Takac and Michal Zabovsky. 2012. Data analysis in public social networks. In International scientific conference and international workshop present day trends of innovations, Vol. 1. Present Day Trends of Innovations Lamza Poland.
[72]
Sattar Vakili, Henry Moss, Artem Artemev, Vincent Dutordoir, and Victor Picheny. 2021. Scalable Thompson sampling using sparse Gaussian process models. Advances in neural information processing systems 34 (2021), 5631--5643.
[73]
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th {USENIX} symposium on networked systems design and implementation ({NSDI} 16). 363--378.
[74]
Joannes Vermorel and Mehryar Mohri. 2005. Multi-armed bandit algorithms and empirical evaluation. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16. Springer, 437--448.
[75]
Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, KK Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, and Alex X Liu. 2022. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In Proceedings of the 13th Symposium on Cloud Computing. 16--30.
[76]
Christopher KI Williams and Carl Edward Rasmussen. 2006. Gaussian processes for machine learning. Vol. 2. MIT press Cambridge, MA.
[77]
James Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, and Marc Deisenroth. 2020. Efficiently sampling functions from Gaussian process posteriors. In International Conference on Machine Learning. PMLR, 10292--10302.
[78]
Jinhan Xin, Kai Hwang, and Zhibin Yu. 2022. LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications [Extended Version]. arXiv preprint arXiv:2203.14889 (2022).
[79]
Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing. 1--14.
[80]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19). 1049--1062.
[81]
Yongkang Zhang, Yinghao Yu, Wei Wang, Qiukai Chen, Jie Wu, Zuowei Zhang, Jiang Zhong, Tianchen Ding, Qizhen Weng, Lingyun Yang, et al. 2022. Workload consolidation in alibaba clusters: the good, the bad, and the ugly. In Proceedings of the 13th Symposium on Cloud Computing. 210--225.
[82]
Yuqiu Zhang, Tongkun Zhang, Gengrui Zhang, and Hans-Arno Jacobsen. 2023. Lifting the Fog of Uncertainties: Dynamic Resource Orchestration for the Containerized Cloud. arXiv preprint (2023).
[83]
Chenxingyu Zhao, Tapan Chugh, Jaehong Min, Ming Liu, and Arvind Krishnamurthy. 2022. Dremel: Adaptive Configuration Tuning of RocksDB KV-Store. In Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. 61--62.
[84]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering (2018).

Cited By

View all
  • (2024)An Adaptive Cloud Resource Quota Scheme Based on Dynamic Portraits and Task-Resource MatchingIEEE Transactions on Cloud Computing10.1109/TCC.2024.341039012:4(996-1010)Online publication date: Oct-2024
  • (2024)Next-Generation Cloud Databases: Balancing Performance, Sustainability, and Resource Management2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10837926(1-5)Online publication date: 11-Nov-2024
  • (2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing
October 2023
624 pages
ISBN:9798400703874
DOI:10.1145/3620678
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '23
Sponsor:
SoCC '23: ACM Symposium on Cloud Computing
October 30 - November 1, 2023
CA, Santa Cruz, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)249
  • Downloads (Last 6 weeks)25
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Adaptive Cloud Resource Quota Scheme Based on Dynamic Portraits and Task-Resource MatchingIEEE Transactions on Cloud Computing10.1109/TCC.2024.341039012:4(996-1010)Online publication date: Oct-2024
  • (2024)Next-Generation Cloud Databases: Balancing Performance, Sustainability, and Resource Management2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10837926(1-5)Online publication date: 11-Nov-2024
  • (2023)Lifting the Fog of UncertaintiesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624646(48-64)Online publication date: 30-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media