skip to main content
10.1145/3431379.3460641acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

Authors Info & Claims
Published:21 June 2021Publication History

ABSTRACT

Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources.

In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.

References

  1. Jelena Antić, Georgios Chatzopoulos, Rachid Guerraoui, and Vasileios Trigonakis. 2016. Locking made easy. In Proceedings of the International Middleware Conference (Middleware). 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. ACM SIGOPS operating systems review, Vol. 37, 5, 164--177.Google ScholarGoogle Scholar
  3. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Hans-J Boehm. 2007. Reordering constraints for pthread-style locks. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). 173--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carl Boettiger. 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, Vol. 49, 1 (2015), 71--79.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium. 119--130.Google ScholarGoogle Scholar
  7. Miao Cai, Shenming Liu, and Hao Huang. 2017. tScale: a contention-aware multithreaded framework for multicore multiprocessor systems. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS). 334--343.Google ScholarGoogle ScholarCross RefCross Ref
  8. Gaurav Chadha, Scott Mahlke, and Satish Narayanasamy. 2012. When less is more (LIMO): controlled parallelism for improved efficiency. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 141--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. 2011. Supporting overcommitted virtual machines through hardware spin detection. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 23, 2 (2011), 353--366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, Vol. 5, 1 (1998), 46--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wesam Dawoud, Ibrahim Takouna, and Christoph Meinel. 2011. Elastic vm for cloud resources provisioning optimization. In Proceedings of the International Conference on Advances in Computing and Communications (ICACC). 431--445.Google ScholarGoogle ScholarCross RefCross Ref
  12. Dave Dice. 2017. Malthusian locks. In Proceedings of the European Conference on Computer Systems (Eurosys). 314--327.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 125--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trigonakis. 2016. Unlocking energy. In Proceedings of the USENIX Annual Technical Conference (ATC). 393--406.Google ScholarGoogle Scholar
  15. Thomas Gleixner and Douglas Niehaus. 2006. Hrtimers and beyond: Transforming the linux time subsystems. In Proceedings of the Linux Symposium, Vol. 1. 333--346.Google ScholarGoogle Scholar
  16. Hugo Guiroux, Renaud Lachaize, and Vivien Quéma. 2016. Multicore Locks: The Case Is Not Closed Yet. In Proceedings of the USENIX Annual Technical Conference (ATC). 649--662.Google ScholarGoogle Scholar
  17. Bijun He, William N. Scherer, and Michael L. Scott. 2005. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the International Conference on High-Performance Computing (HIPC). 7--18.Google ScholarGoogle Scholar
  18. Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. 2013. Elasticity in Cloud Computing: What It Is, and What It Is Not. In Proceedings of the International Conference on Autonomic Computing (ICAC). 23--27.Google ScholarGoogle Scholar
  19. Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, and David I. August. 2016. Speculatively exploiting cross-invocation parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 207--221.Google ScholarGoogle Scholar
  20. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd Mowry. 2010. Decoupling contention management from scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 117--128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. 2019. Scalable and practical locking with shuffling. In Proceedings of the Symposium on Operating Systems Principles (SOSP). 586--599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scalable NUMA-aware blocking synchronization primitives. In Proceedings of the USENIX Annual Technical Conference (ATC). 603--615.Google ScholarGoogle Scholar
  23. Ozgur Kilic, Spoorti Doddamani, Aprameya Bhat, Hardik Bagdi, and Kartik Gopalan. 2018. Overcoming Virtualization Overheads for Large-vCPU Virtual Machines. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 369--380.Google ScholarGoogle ScholarCross RefCross Ref
  24. Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. Kvm: the Linux virtual machine monitor. In Proceedings of the Linux Symposium, Vol. 1. 225--230.Google ScholarGoogle Scholar
  25. Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, and Nathan Clark. 2010. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), Vol. 38. 270--279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the Cost of Context Switch. In Proceedings of the 2007 Workshop on Experimental Computer Science (ExpCS). Article 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tong Li, Alvin R. Lebeck, and Daniel J. Sorin. 2006. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 17, 6 (2006), 508--521.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tim Lindholm and Frank Yellin. 1997. Inside the Java virtual machine. Unix Review, Vol. 15, 1 (1997), 7.Google ScholarGoogle Scholar
  29. Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from Alibaba trace. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). 347--360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jack Lo. 2005. VMware and CPU virtualization technology. World Wide Web Wlectronic Publication (2005).Google ScholarGoogle Scholar
  31. Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the European Conference on Parallel Processing (Euro-Par). 801--810.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jose Monsalve, Aaron Landwehr, and Michela Taufer. 2015. Dynamic cpu resource allocation in containerized cloud environments. In Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). 535--536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. NPB. 2019. NAS Parallel Benchmarks. https://www.nas.nasa.gov/.Google ScholarGoogle Scholar
  34. Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: core-aware thread management. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 145--160.Google ScholarGoogle Scholar
  35. Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Analyzing lock contention in multithreaded applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Vol. 45. 269--280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 24--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad Hoc Synchronization Considered Harmful. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 163--176.Google ScholarGoogle Scholar

Index Terms

  1. Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
      June 2021
      275 pages
      ISBN:9781450382175
      DOI:10.1145/3431379

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader