ABSTRACT
Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources.
In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.
- Jelena Antić, Georgios Chatzopoulos, Rachid Guerraoui, and Vasileios Trigonakis. 2016. Locking made easy. In Proceedings of the International Middleware Conference (Middleware). 1--14.Google ScholarDigital Library
- Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. ACM SIGOPS operating systems review, Vol. 37, 5, 164--177.Google Scholar
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.Google ScholarDigital Library
- Hans-J Boehm. 2007. Reordering constraints for pthread-style locks. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). 173--182.Google ScholarDigital Library
- Carl Boettiger. 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, Vol. 49, 1 (2015), 71--79.Google ScholarDigital Library
- Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium. 119--130.Google Scholar
- Miao Cai, Shenming Liu, and Hao Huang. 2017. tScale: a contention-aware multithreaded framework for multicore multiprocessor systems. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS). 334--343.Google ScholarCross Ref
- Gaurav Chadha, Scott Mahlke, and Satish Narayanasamy. 2012. When less is more (LIMO): controlled parallelism for improved efficiency. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 141--150.Google ScholarDigital Library
- Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. 2011. Supporting overcommitted virtual machines through hardware spin detection. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 23, 2 (2011), 353--366.Google ScholarDigital Library
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, Vol. 5, 1 (1998), 46--55.Google ScholarDigital Library
- Wesam Dawoud, Ibrahim Takouna, and Christoph Meinel. 2011. Elastic vm for cloud resources provisioning optimization. In Proceedings of the International Conference on Advances in Computing and Communications (ICACC). 431--445.Google ScholarCross Ref
- Dave Dice. 2017. Malthusian locks. In Proceedings of the European Conference on Computer Systems (Eurosys). 314--327.Google ScholarDigital Library
- Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 125--137.Google ScholarDigital Library
- Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trigonakis. 2016. Unlocking energy. In Proceedings of the USENIX Annual Technical Conference (ATC). 393--406.Google Scholar
- Thomas Gleixner and Douglas Niehaus. 2006. Hrtimers and beyond: Transforming the linux time subsystems. In Proceedings of the Linux Symposium, Vol. 1. 333--346.Google Scholar
- Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Multicore Locks: The Case Is Not Closed Yet. In Proceedings of the USENIX Annual Technical Conference (ATC). 649--662.Google Scholar
- Bijun He, William N. Scherer, and Michael L. Scott. 2005. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the International Conference on High-Performance Computing (HIPC). 7--18.Google Scholar
- Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. 2013. Elasticity in Cloud Computing: What It Is, and What It Is Not. In Proceedings of the International Conference on Autonomic Computing (ICAC). 23--27.Google Scholar
- Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, and David I. August. 2016. Speculatively exploiting cross-invocation parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 207--221.Google Scholar
- Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 117--128.Google ScholarDigital Library
- Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. 2019. Scalable and practical locking with shuffling. In Proceedings of the Symposium on Operating Systems Principles (SOSP). 586--599.Google ScholarDigital Library
- Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scalable NUMA-aware blocking synchronization primitives. In Proceedings of the USENIX Annual Technical Conference (ATC). 603--615.Google Scholar
- Ozgur Kilic, Spoorti Doddamani, Aprameya Bhat, Hardik Bagdi, and Kartik Gopalan. 2018. Overcoming Virtualization Overheads for Large-vCPU Virtual Machines. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 369--380.Google ScholarCross Ref
- Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. Kvm: the Linux virtual machine monitor. In Proceedings of the Linux Symposium, Vol. 1. 225--230.Google Scholar
- Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), Vol. 38. 270--279.Google ScholarDigital Library
- Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the Cost of Context Switch. In Proceedings of the 2007 Workshop on Experimental Computer Science (ExpCS). Article 2.Google ScholarDigital Library
- Tong Li, Alvin R. Lebeck, and Daniel J. Sorin. 2006. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 17, 6 (2006), 508--521.Google ScholarDigital Library
- Tim Lindholm and Frank Yellin. 1997. Inside the Java virtual machine. Unix Review, Vol. 15, 1 (1997), 7.Google Scholar
- Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from Alibaba trace. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). 347--360.Google ScholarDigital Library
- Jack Lo. 2005. VMware and CPU virtualization technology. World Wide Web Wlectronic Publication (2005).Google Scholar
- Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the European Conference on Parallel Processing (Euro-Par). 801--810.Google ScholarDigital Library
- Jose Monsalve, Aaron Landwehr, and Michela Taufer. 2015. Dynamic cpu resource allocation in containerized cloud environments. In Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). 535--536.Google ScholarDigital Library
- NPB. 2019. NAS Parallel Benchmarks. https://www.nas.nasa.gov/.Google Scholar
- Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: core-aware thread management. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 145--160.Google Scholar
- Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Vol. 45. 269--280.Google ScholarDigital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 24--36.Google ScholarDigital Library
- Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad Hoc Synchronization Considered Harmful. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 163--176.Google Scholar
Index Terms
- Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription
Recommendations
Transparently bridging semantic gap in CPU management for virtualized environments
Consolidated environments are progressively accommodating diverse and unpredictable workloads in conjunction with virtual desktop infrastructure and cloud computing. Unpredictable workloads, however, aggravate the semantic gap between the virtual ...
AWS EC2 vs. Joyent's Triton: A Comparison of Docker Container-hosting Platforms
ScienceCloud '17: Proceedings of the 8th Workshop on Scientific Cloud ComputingContainers, and in particular Docker, have emerged as promising addition and in some cases, alternative to virtual machines (VMs) for application deployment in cloud environments. Docker containers enable massive scalability and rapid deployment of ...
A Virtual CPU Scheduling Model for I/O Performance in Paravirtualized Environments
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsParavirtualization manages virtual machines and virtual resources efficiently by the communication between the virtualization layer and modified guest OSes. In a paravirtual environment, the I/O response of a virtual machine is hard to approach that of ...
Comments