skip to main content
10.1145/3431379.3460641acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

Published: 21 June 2021 Publication History

Abstract

Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources.
In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.

References

[1]
Jelena Antić, Georgios Chatzopoulos, Rachid Guerraoui, and Vasileios Trigonakis. 2016. Locking made easy. In Proceedings of the International Middleware Conference (Middleware). 1--14.
[2]
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. ACM SIGOPS operating systems review, Vol. 37, 5, 164--177.
[3]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.
[4]
Hans-J Boehm. 2007. Reordering constraints for pthread-style locks. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). 173--182.
[5]
Carl Boettiger. 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, Vol. 49, 1 (2015), 71--79.
[6]
Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium. 119--130.
[7]
Miao Cai, Shenming Liu, and Hao Huang. 2017. tScale: a contention-aware multithreaded framework for multicore multiprocessor systems. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS). 334--343.
[8]
Gaurav Chadha, Scott Mahlke, and Satish Narayanasamy. 2012. When less is more (LIMO): controlled parallelism for improved efficiency. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 141--150.
[9]
Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. 2011. Supporting overcommitted virtual machines through hardware spin detection. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 23, 2 (2011), 353--366.
[10]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, Vol. 5, 1 (1998), 46--55.
[11]
Wesam Dawoud, Ibrahim Takouna, and Christoph Meinel. 2011. Elastic vm for cloud resources provisioning optimization. In Proceedings of the International Conference on Advances in Computing and Communications (ICACC). 431--445.
[12]
Dave Dice. 2017. Malthusian locks. In Proceedings of the European Conference on Computer Systems (Eurosys). 314--327.
[13]
Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 125--137.
[14]
Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trigonakis. 2016. Unlocking energy. In Proceedings of the USENIX Annual Technical Conference (ATC). 393--406.
[15]
Thomas Gleixner and Douglas Niehaus. 2006. Hrtimers and beyond: Transforming the linux time subsystems. In Proceedings of the Linux Symposium, Vol. 1. 333--346.
[16]
Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Multicore Locks: The Case Is Not Closed Yet. In Proceedings of the USENIX Annual Technical Conference (ATC). 649--662.
[17]
Bijun He, William N. Scherer, and Michael L. Scott. 2005. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the International Conference on High-Performance Computing (HIPC). 7--18.
[18]
Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. 2013. Elasticity in Cloud Computing: What It Is, and What It Is Not. In Proceedings of the International Conference on Autonomic Computing (ICAC). 23--27.
[19]
Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, and David I. August. 2016. Speculatively exploiting cross-invocation parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 207--221.
[20]
Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 117--128.
[21]
Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. 2019. Scalable and practical locking with shuffling. In Proceedings of the Symposium on Operating Systems Principles (SOSP). 586--599.
[22]
Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scalable NUMA-aware blocking synchronization primitives. In Proceedings of the USENIX Annual Technical Conference (ATC). 603--615.
[23]
Ozgur Kilic, Spoorti Doddamani, Aprameya Bhat, Hardik Bagdi, and Kartik Gopalan. 2018. Overcoming Virtualization Overheads for Large-vCPU Virtual Machines. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 369--380.
[24]
Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. Kvm: the Linux virtual machine monitor. In Proceedings of the Linux Symposium, Vol. 1. 225--230.
[25]
Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), Vol. 38. 270--279.
[26]
Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the Cost of Context Switch. In Proceedings of the 2007 Workshop on Experimental Computer Science (ExpCS). Article 2.
[27]
Tong Li, Alvin R. Lebeck, and Daniel J. Sorin. 2006. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 17, 6 (2006), 508--521.
[28]
Tim Lindholm and Frank Yellin. 1997. Inside the Java virtual machine. Unix Review, Vol. 15, 1 (1997), 7.
[29]
Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from Alibaba trace. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). 347--360.
[30]
Jack Lo. 2005. VMware and CPU virtualization technology. World Wide Web Wlectronic Publication (2005).
[31]
Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the European Conference on Parallel Processing (Euro-Par). 801--810.
[32]
Jose Monsalve, Aaron Landwehr, and Michela Taufer. 2015. Dynamic cpu resource allocation in containerized cloud environments. In Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). 535--536.
[33]
NPB. 2019. NAS Parallel Benchmarks. https://www.nas.nasa.gov/.
[34]
Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: core-aware thread management. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 145--160.
[35]
Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Vol. 45. 269--280.
[36]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 24--36.
[37]
Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad Hoc Synchronization Considered Harmful. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 163--176.

Cited By

View all
  • (2025)HTLL: Latency-Aware Scalable Blocking MutexIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.352685936:3(471-486)Online publication date: 1-Mar-2025
  • (2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
  • (2024)SlackVM: Packing Virtual Machines in Oversubscribed Cloud Infrastructures2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00024(190-201)Online publication date: 24-Sep-2024
  • Show More Cited By

Index Terms

  1. Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
    June 2021
    275 pages
    ISBN:9781450382175
    DOI:10.1145/3431379
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. container
    2. elasticity
    3. over-threading
    4. performance
    5. scheduling

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program
    • National Science Foundation of China

    Conference

    HPDC '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)HTLL: Latency-Aware Scalable Blocking MutexIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.352685936:3(471-486)Online publication date: 1-Mar-2025
    • (2024)Exploiting Elasticity via OS-Runtime Cooperation to Improve CPU Utilization in Multicore Systems2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00014(35-43)Online publication date: 20-Mar-2024
    • (2024)SlackVM: Packing Virtual Machines in Oversubscribed Cloud Infrastructures2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00024(190-201)Online publication date: 24-Sep-2024
    • (2024)A neural network framework for optimizing parallel computing in cloud serversJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103131150:COnline publication date: 1-May-2024
    • (2023)An Economy-Oriented GPU Virtualization With Dynamic and Adaptive OversubscriptionIEEE Transactions on Computers10.1109/TC.2022.319999872:5(1371-1383)Online publication date: 1-May-2023
    • (2023)Precise control of page cache for containersFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-2455-018:2Online publication date: 13-Sep-2023
    • (2022)Performance Implications of Thread Count on OS Level Factors in Multithreaded Applications2022 6th International Conference On Computing, Communication, Control And Automation (ICCUBEA10.1109/ICCUBEA54992.2022.10011099(1-5)Online publication date: 26-Aug-2022
    • (2022)Reducing Cache Miss Rate Using Thread Oversubscription to Accelerate an MPI-OpenMP-Based 2-D Hopmoc MethodComputational Science and Its Applications – ICCSA 202210.1007/978-3-031-10522-7_24(337-353)Online publication date: 4-Jul-2022
    • (2021)Research and Practice of Container System2021 International Symposium on Theoretical Aspects of Software Engineering (TASE)10.1109/TASE52547.2021.00013(13-14)Online publication date: Aug-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media