research-article

Maximizing VMs' IO Performance on Overcommitted CPUs with Fairness

Authors:
Tong Xing

The University of Edinburgh and Stevens Institute of Technology, Hoboken, NJ, USA

The University of Edinburgh and Stevens Institute of Technology, Hoboken, NJ, USA

0000-0003-2099-6418
View Profile

,
Cong Xiong

The University of Edinburgh

The University of Edinburgh

0009-0005-7221-3229
View Profile

,
Chuan Ye

Huawei Cloud

Huawei Cloud

0009-0009-4836-0252
View Profile

,
Qi Wei

Huawei Cloud

Huawei Cloud

0009-0009-6849-9198
View Profile

,
Javier Picorel

Huawei Cloud

Huawei Cloud

0009-0009-6984-1303
View Profile

,
Antonio Barbalace

The University of Edinburgh and Huawei Technology, Munich, Bavaria, Germany

The University of Edinburgh and Huawei Technology, Munich, Bavaria, Germany

0000-0003-1641-0779
View Profile

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud ComputingOctober 2023Pages 93–108https://doi.org/10.1145/3620678.3624649

Published:31 October 2023Publication History

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

Pages 93–108

ABSTRACT

To improve resource utilization and reduce costs many Cloud providers adopt virtual machines (VMs) overcommitment. While effective, this strategy may lead to adverse outcomes, significantly affecting a VM IO performance when one virtual CPU (vCPU) is preempted by another vCPU within the same runqueue of the VM scheduler -- i.e., same physical CPU (pCPU). Additionally, the responsiveness of a VM is reduced during the inactive time of the vCPU, and it necessitates an extra schedule timeslice to react to any IO event. While such problems have been studied in academia and industry, no previous solution has been deployed in production. This is because for example certain solutions require modifications of the guest VM, which is in contrast with industry requirements.

We propose Anubis, a new IO-aware VM scheduler targeting Linux KVM, the most popular VMM in today's Clouds, without requiring any guest VM modifications. Anubis shortens the IO event pending time by lightweight monitoring IO events including interrupt delivery and KVM exit. For the vCPU running the IO activity, Anubis provides an accurate boost, which is exclusively active only during the period when the vCPU has IO activity. While the IO performance is maximized, Anubis still guarantees fairness among VMs. The vCPU that doesn't have IO activity and belongs to the same VM will voluntarily yield the computing resources to counterbalance the unfairness created by the vCPU that has been given a performance boost. Overall, Anubis is a practical solution that provides close-to-non-overcommit performance for IO workloads in VM overcommitted scenarios.

References

Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. 2020. Firecracker: Lightweight Virtualization for Serverless Applications. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (Santa Clara, CA, USA)(NSDI'20). USENIX Association, USA, 419--434.Google Scholar
Amazon. 2020. AWS Lambda Website. https://aws.amazon.com/lambda.Google Scholar
Amazon. 2022. How Amazon ECS manages CPU and memory resources. https://aws.amazon.com/blogs/containers/how-amazonecs-manages-cpu-and-memory-resources/.Google Scholar
Amazon. 2023. Burstable performance instances. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html.Google Scholar
Apache. 2023. ab - Apache HTTP server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html.Google Scholar
AWS. 2023. BBC delivers live, UHD coverage of UEFA Euros and Wimbledon with AWS. https://aws.amazon.com/cn/blogs/media/bbc-delivers-live-uhd-coverage-of-uefa-euros-and-wimbledon-with-aws/.Google Scholar
AWS. 2023. Explore.org live streams nature cams to global audiences with AWS. https://aws.amazon.com/cn/blogs/media/explore-org-live-streams-nature-cams-to-global-audiences-with-aws/.Google Scholar
AWS. 2023. LAMP Server on AWS. https://aws.amazon.com/marketplace/pp/prodview-gqnnpbafrkkys.Google Scholar
AWS. 2023. Partner Success with AWS. https://aws.amazon.com/partners/success/.Google Scholar
AWS. 2023. Washington Post's Arc publishing platform uses AWS to transform the broadcast landscape. https://aws.amazon.com/cn/blogs/media/washington-posts-arc-publishing-platform-uses-aws-to-transform-the-broadcast-landscape/.Google Scholar
Blueprint. 2022. https://blueprints.launchpad.net/nova/+spec/nova-change-default-overcommit-values.Google Scholar
Justinien Bouron, Sebastien Chevalley, Baptiste Lepers, Willy Zwaenepoel, Redha Gouicem, Julia Lawall, Gilles Muller, and Julien Sopena. 2018. The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA) (USENIX ATC '18). 85--96.Google Scholar
Kevin Burns, Antonio Barbalace, Vincent Legout, and Binoy Ravindran. 2014. KairosVM: Deterministic introspection for real-time virtual machine hierarchical scheduling. In Proceedings of the 2014 IEEE Emerging Technology and Factory Automation (ETFA). 1--8. https://doi.org/10.1109/ETFA.2014.7005061Google ScholarCross Ref
Kevin Burns, Vincent Legout, Antonio Barbalace, and Binoy Ravindran. 2019. PrVM: A Multicore Real-Time Virtualization Scheduling Framework with Probabilistic Timing Guarantees. SIGBED Rev. 16, 3 (nov 2019), 14--20. https://doi.org/10.1145/3373400.3373402Google ScholarDigital Library
Luwei Cheng and Cho-Li Wang. 2012. VBalance: Using Interrupt Load Balance to Improve I/O Performance for SMP Virtual Machines (SoCC '12). Association for Computing Machinery, New York, NY, USA, Article 2, 14 pages. https://doi.org/10.1145/2391229.2391231Google ScholarDigital Library
Huawei Cloud. 2023. Elastic Cloud Server (ECS). https://www.huaweicloud.com/intl/en-us/product/ecs.html.Google Scholar
Huawei Cloud. 2023. A Summary List of x86 ECS Specifications. https://support.huaweicloud.com/intl/en-us/productdesc-ecs/ecs_01_0014.html.Google Scholar
Key concepts and definitions for burstable performance instances. 2023. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html.Google Scholar
Mehiar Dabbagh, Bechir Hamdaoui, Mohsen Guizani, and Ammar Rayes. 2015. Efficient datacenter resource utilization through cloud resource overcommitment. In 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 330--335. https://doi.org/10.1109/INFCOMW.2015.7179406Google ScholarCross Ref
Xiaoning Ding, Phillip B. Gibbons, and Michael A. Kozuch. 2013. A Hidden Cost of Virtualization When Scaling Multicore Applications. In 5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 13). USENIX Association, San Jose, CA. https://www.usenix.org/conference/hotcloud13/workshop-program/presentations/dingGoogle Scholar
Xiaoning Ding, Phillip B. Gibbons, Michael A. Kozuch, and Jianchen Shan. 2014. Gleaner: Mitigating the Blocked-Waiter Wakeup Problem for Virtualized Multicore Applications. In 2014 USENIX Annual Technical Conference (USENIXATC 14). USENIX Association, Philadelphia, PA, 73--84. https://www.usenix.org/conference/atc14/technical-sessions/presentation/dingGoogle Scholar
Michael Drescher, Vincent Legout, Antonio Barbalace, and Binoy Ravindran. 2016. A Flattened Hierarchical Scheduler for Real-Time Virtualization. In Proceedings of the 13th International Conference on Embedded Software (Pittsburgh, Pennsylvania) (EMSOFT '16). Association for Computing Machinery, Article 12, 10 pages. https://doi.org/10.1145/2968478.2968501Google ScholarDigital Library
Sahan Gamage, Cong Xu, Ramana Rao Kompella, and Dongyan Xu. 2014. VPipe: Piped I/O Offloading for Efficient Data Movement in Virtualized Clouds (SOCC '14). Association for Computing Machinery, New York, NY, USA, 1--13. https://doi.org/10.1145/2670979.2671006Google ScholarDigital Library
Google. 2020. Google Cloud Functions. https://cloud.google.com/functions.Google Scholar
Google. 2022. Get more from every core: Announcing CPU overcommit for Compute Engine. https://cloud.google.com/blog/products/compute/cpu-overcommit-for-sole-tenant-nodes-now-ga.Google Scholar
Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM Allocation Service at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 845--861. https://www.usenix.org/conference/osdi20/presentation/hadaryGoogle Scholar
Hadoop. 2023. https://hadoop.apache.org/.Google Scholar
Red Hat. 2017. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/6.0_technical_notes/deployment.Google Scholar
HBase. 2023. https://hbase.apache.org/.Google Scholar
Hadoop Distributed File System (HDFS™). 2023. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesi.Google Scholar
Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2016. Serverless computation with openlambda. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16).Google ScholarDigital Library
IBM. 2022. https://www.ibm.com/docs/en/cic/1.1.3?topic=SSLL2F_1.1.3/com.ibm.cloudin.doc/admintasks/configuring/customizing/allocation_ratio_templates.htm.Google Scholar
iperf3. 2023. https://github.com/esnet/iperf.Google Scholar
Kenta Ishiguro, Naoki Yasuno, Pierre-Louis Aublin, and Kenji Kono. 2021. Mitigating Excessive VCPU Spinning in VM-Agnostic KVM. In Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (Virtual, USA) (VEE 2021). Association for Computing Machinery, New York, NY, USA, 139--152. https://doi.org/10.1145/3453933.3454020Google ScholarDigital Library
Weiwei Jia, Cheng Wang, Xusheng Chen, Jianchen Shan, Xiaowei Shang, Heming Cui, Xiaoning Ding, Luwei Cheng, Francis C. M. Lau, Yuexuan Wang, and Yuangang Wang. 2018. Effectively Mitigating I/O Inactivity in VCPU Scheduling. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA) (USENIX ATC '18). USENIX Association, USA, 267--279.Google Scholar
Ardalan Kangarlou, Sahan Gamage, Ramana Rao Kompella, and Dongyan Xu. 2010. vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload. In SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. https://doi.org/10.1109/SC.2010.57Google ScholarDigital Library
J. Kay and P. Lauder. 1988. A Fair Share Scheduler. Commun. ACM 31, 1 (jan 1988), 44--55. https://doi.org/10.1145/35043.35047Google ScholarDigital Library
Linux kernel. 2023. https://elixir.bootlin.com/linux/v4.14.325/source/arch/x86/kernel/apic/apic.c.Google Scholar
Linux kernel. 2023. https://elixir.bootlin.com/linux/v4.14.325/source/arch/x86/kernel/apic/apic_flat_64.c.Google Scholar
Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, and Joonwon Lee. 2009. Task-Aware Virtual Machine Scheduling for I/O Performance. (VEE '09). Association for Computing Machinery, New York, NY, USA, 101--110. https://doi.org/10.1145/1508293.1508308Google ScholarDigital Library
Linux KVM. 2023. https://www.linux-kvm.org/page/Main_Page.Google Scholar
LEMP. 2023. https://lemp.io/.Google Scholar
Scott D. Lowe. [n. d.]. Best Practices for Oversubscription of CPU, Memory and Storage in vSphere Virtual Environments. Dell.Google Scholar
Hui Lu, Brendan Saltaformaggio, Ramana Kompella, and Dongyan Xu. 2015. VFair: Latency-Aware Fair Storage Scheduling via per-IO Cost-Based Differentiation (SoCC '15). Association for Computing Machinery, New York, NY, USA, 125--138. https://doi.org/10.1145/2806777.2806943Google ScholarDigital Library
Hui Lu, Cong Xu, Cheng Cheng, Ramana Kompella, and Dongyan Xu. 2015. vHaul: Towards Optimal Scheduling of Live Multi-VM Migration for Multi-tier Applications. In 2015 IEEE 8th International Conference on Cloud Computing. 453--460. https://doi.org/10.1109/CLOUD.2015.67Google ScholarDigital Library
LWN.2011. https://lwn.net/Articles/444503/.Google Scholar
Microsoft. 2020. Microsoft Azure Functions. https://azure.microsoft.com/en-us/services/functions.Google Scholar
Microsoft. 2023. https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-b-series-burstable-workload-example.Google Scholar
MongoDB. 2023. https://www.mongodb.com.Google Scholar
Ricardo Neri. 2022. https://www.spinics.net/lists/kernel/msg4348466.html.Google Scholar
Nginx. 2023. https://nginx.org/.Google Scholar
Diego Ongaro, Alan L. Cox, and Scott Rixner. 2008. Scheduling I/O in Virtual Machine Monitors. In Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (Seattle, WA, USA) (VEE '08). Association for Computing Machinery, New York, NY, USA, 1--10. https://doi.org/10.1145/1346256.1346258Google ScholarDigital Library
OpenEuler. 2022. https://docs.openeuler.org/en/docs/20.03_LTS_SP2/docs/Virtualization/best-practices.html.Google Scholar
Openstack. 2022. https://docs.openstack.org/arch-design/design-compute/design-compute-overcommit.html.Google Scholar
Oracle. 2023. https://www.oracle.com/uk/a/ocom/docs/why-kvmis-winning.pdf.Google Scholar
Mongodb performance test. 2023. https://github.com/idealo/mongodb-performance-test.Google Scholar
PostMark. 2023. https://www.filesystems.org/docs/auto-pilot/Postmark.html.Google Scholar
Xen Project. 2023. https://xenproject.org/.Google Scholar
Redis. 2023. https://redis.io/.Google Scholar
Redis-benchmark. 2023. https://redis.io/docs/management/optimization/benchmarks/.Google Scholar
Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC'20). USENIX Association, USA, Article 14, 14 pages.Google Scholar
Jianchen Shan, Weiwei Jia, and Xiaoning Ding. 2017. Rethinking Multicore Application Scalability on Big Virtual Machines. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 694--701. https://doi.org/10.1109/ICPADS.2017.00094Google ScholarCross Ref
Kun Suo, Yong Zhao, Jia Rao, Luwei Cheng, Xiaobo Zhou, and Francis C. M. Lau. 2017. Preserving I/O Prioritization in Virtualized OSes. In Proceedings of the 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC '17). Association for Computing Machinery, New York, NY, USA, 269--281. https://doi.org/10.1145/3127479.3127484Google ScholarDigital Library
sysbench. 2023. https://github.com/akopytov/sysbench.Google Scholar
Boris Teabe, Alain Tchana, and Daniel Hagimont. 2016. Application-Specific Quantum for Multi-Core Platform Scheduler. In Proceedings of the Eleventh European Conference on Computer Systems (London, United Kingdom) (EuroSys '16). Association for Computing Machinery, New York, NY, USA, Article 3, 14 pages. https://doi.org/10.1145/2901318.2901340Google ScholarDigital Library
Twitch. 2023. https://blog.twitch.tv/en/2017/10/10/live-video-transmuxing-transcoding-f-fmpeg-vs-twitch-transcoder-part-i-489c1c125f28/.Google Scholar
Vmware. 2022. https://www.reddit.com/r/vmware/comments/dl2bt8/do_you_overcommit_cpu_in_your_environement/.Google Scholar
Intel® 64 Architecture x2APIC Specification. 2023. https://www.naic.edu/~phil/software/intel/318148.pdf.Google Scholar
xen. 2013. https://wiki.xenproject.org/wiki/Credit2_Scheduler.Google Scholar
Cong Xu, Sahan Gamage, Pawan N. Rao, Ardalan Kangarlou, Ramana Rao Kompella, and Dongyan Xu. 2012. VSlicer: Latency-Aware Virtual Machine Scheduling via Differentiated-Frequency CPU Slicing (HPDC '12). Association for Computing Machinery, New York, NY, USA, 3--14. https://doi.org/10.1145/2287076.2287080Google ScholarDigital Library
Cong Xu, Brendan Saltaformaggio, Sahan Gamage, Ramana Rao Kompella, and Dongyan Xu. 2015. VRead: Efficient Data Access for Hadoop in Virtualized Clouds. In Proceedings of the 16th Annual Middleware Conference (Vancouver, BC, Canada) (Middleware '15). Association for Computing Machinery, New York, NY, USA, 125--136. https://doi.org/10.1145/2814576.2814735Google ScholarDigital Library
Yahoo. 2023. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB.Google Scholar
Olive Zhao. 2021. https://forum.huawei.com/enterprise/en/why-are-huawei-cloud-computing-products-switched-from-xen-to-kvm/thread/818617-893.Google Scholar

Index Terms

Maximizing VMs' IO Performance on Overcommitted CPUs with Fairness
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
      2. Software infrastructure
        Virtual machines
    2. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

Isolating commodity hosted hypervisors with HyperLock
EuroSys '12: Proceedings of the 7th ACM european conference on Computer Systems

Hosted hypervisors (e.g., KVM) are being widely deployed. One key reason is that they can effectively take advantage of the mature features and broad user bases of commodity operating systems. However, they are not immune to exploitable software bugs. ...
Read More
Sampling-based Steal Time Accounting under Hardware Virtualization
ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

Virtualization enables the efficient sharing of hardware resources among multiple virtual machines (VMs). Because the physical resources are limited, the scheduler must often suspend one VM to allow some other VM to run. The operating system in a VM is ...
Read More
GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications
CLOUD '14: Proceedings of the 2014 IEEE International Conference on Cloud Computing

As more scientific workloads are moved into the cloud, the need for high performance accelerators increases. Accelerators such as GPUs offer improvements in both performance and power efficiency over traditional multi-core processors, however, their use ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing
October 2023
624 pages
ISBN:9798400703874
DOI:10.1145/3620678

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
IO performance
KVM
Linux
Overcommit
compute resources
fair scheduling
low-latency
virtualization
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 234
  Total Downloads
- Downloads (Last 12 months)234
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Maximizing VMs' IO Performance on Overcommitted CPUs with Fairness

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Isolating commodity hosted hypervisors with HyperLock

Sampling-based Steal Time Accounting under Hardware Virtualization

GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Maximizing VMs' IO Performance on Overcommitted CPUs with Fairness

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Isolating commodity hosted hypervisors with HyperLock

Sampling-based Steal Time Accounting under Hardware Virtualization

GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media