skip to main content
10.1145/3357223.3362717acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Hotspot Mitigations for the Masses

Published:20 November 2019Publication History

ABSTRACT

In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible.

The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.

References

  1. Roberto Amadini, Maurizio Gabbrielli, and Jacopo Mauro. 2015. A Multicore Tool for Constraint Solving. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 232--238. http://dl.acm.org/citation.cfm?id=2832249.2832281Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the Diversity of Cluster Workloads and Its Impact on Research Results. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, Berkeley, CA, USA, 533--546. http://dl.acm.org/citation.cfm?id=3277355.3277407Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eyal Bin, Ofer Biran, Odellia Boni, Erez Hadad, Eliot K. Kolodner, Yosef Moatti, and Dean H. Lorenz. 2011. Guaranteeing High Availability Goals for Virtual Machine Placement. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, USA, 700--709. https://doi.org/10.1109/ICDCS.2011.72Google ScholarGoogle Scholar
  4. J. Blazewicz, J.K. Lenstra, and A.H.G.Rinnooy Kan. 1983. Scheduling subject to resource constraints: classification and complexity. Discrete Applied Mathematics 5, 1 (1983), 11--24. https://doi.org/10.1016/0166-218X(83)90012-4Google ScholarGoogle ScholarCross RefCross Ref
  5. Norman Bobroff, Andrzej Kochut, and Kirk Beaty. 2007. Dynamic Placement of Virtual Machines for Managing SLA Violations. In 2007 10th IFIP/IEEE International Symposium on Integrated Network Management. IEEE, 119--128.Google ScholarGoogle Scholar
  6. Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google ScholarGoogle Scholar
  7. Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ignacio Cano, Srinivas Aiyar, and Arvind Krishnamurthy. 2016. Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 29--41. https://doi.org/10.1145/2987550.2987584Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Henri Casanova, Arnaud Legrand, and Martin Quinson. 2008. SimGrid: A Generic Framework for Large-Scale Distributed Experiments. In Proceedings of the Tenth International Conference on Computer Modeling and Simulation. IEEE Computer Society, Washington, DC, USA, 126--131. https://doi.org/10.1109/UKSIM.2008.28Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2. USENIX Association, Berkeley, CA, USA, 273--286. http://dl.acm.org/citation.cfm?id=1251203.1251223Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, New York, NY, USA, 153--167. https://doi.org/10.1145/3132747.3132772Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. SIGPLAN Not. 48, 4 (March 2013), 77--88. https://doi.org/10.1145/2499368.2451125Google ScholarGoogle Scholar
  13. Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. SIGPLAN Not. 49, 4 (Feb. 2014), 127--144. https://doi.org/10.1145/2644865.2541941Google ScholarGoogle Scholar
  14. Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 97--110. https://doi.org/10.1145/2806777.2806779Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Edison Group. 2018. Hyper-Converged Infrastructure Portfolio Comparison. https://www.emc.com/collateral/analyst-reports/edison-dellemc-hci-competitive.pdf.Google ScholarGoogle Scholar
  16. Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, New York, NY, USA, Article 4, 13 pages. https://doi.org/10.1145/3190508.3190549Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 29--43. https://doi.org/10.1145/945445.945450Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 99--115. http://dl.acm.org/citation.cfm?id=3026877.3026886Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ajay Gulati, Anne Holler, Minwen Ji, Ganesha Shanmuganathan, Carl Wald-spurger, and Xiaoyun Zhu. 2012. Vmware distributed resource management: Design, implementation, and lessons learned. VMware Technical Journal 1, 1 (2012), 45--64.Google ScholarGoogle Scholar
  20. Ajay Gulati, Chethan Kumar, Irfan Ahmad, and Karan Kumar. 2010. BASIL: Automated IO Load Balancing Across Storage Devices.. In FAST, Vol. 10. 169--182.Google ScholarGoogle Scholar
  21. Ajay Gulati, Ganesha Shanmuganathan, Irfan Ahmad, Carl Waldspurger, and Mustafa Uysal. 2011. Pesto: online storage performance management in virtualized datacenters. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fabien Hermenier, Sophie Demassey, and Xavier Lorca. 2011. Bin Repacking Scheduling in Virtualized Datacenters. In Proceedings of the 17th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 27--41. http://dl.acm.org/citation.cfm?id=2041160.2041167Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fabien Hermenier, Julia Lawall, and Gilles Muller. 2013. BtrPlace: A Flexible Consolidation Manager for Highly Available Applications. IEEE Transactions on Dependable Secure Computing 10, 5 (Sept. 2013), 273--286. https://doi.org/10.1109/TDSC.2013.5Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fabien Hermenier, Xavier Lorca, Jean-Marc Menaud, Gilles Muller, and Julia Lawall. 2009. Entropy: a consolidation manager for clusters. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 41--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy Live Migration of Virtual Machines. SIGOPS Oper. Syst. Rev. 43, 3 (July 2009), 14--26. https://doi.org/10.1145/1618525.1618528Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Hirofuchi, A. Lebre, and L. Pouilloux. 2018. SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems. IEEE Transactions on Cloud Computing 6, 1 (Jan 2018), 221--234. https://doi.org/10.1109/TCC.2015.2481422Google ScholarGoogle ScholarCross RefCross Ref
  27. Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, New York, NY, USA, 261--276. https://doi.org/10.1145/1629575.1629601Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Changyeon Jo, Youngsu Cho, and Bernhard Egger. 2017. A Machine Learning Approach to Live Migration Modeling. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, New York, NY, USA, 351--364. https://doi.org/10.1145/3127479.3129262Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. KVM: the Linux virtual machine monitor. In Ottawa Linux Symposium. 225--230.Google ScholarGoogle Scholar
  30. Joseph J. LaViola. 2003. Double Exponential Smoothing: An Alternative to Kalman Filter-based Predictive Tracking. In Proceedings of the Workshop on Virtual Environments 2003. ACM, New York, NY, USA, 199--206. https://doi.org/10.1145/769953.769976Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zitao Liu and Sangyeun Cho. 2012. Characterizing Machines and Workloads on a Google Cluster. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. IEEE Computer Society, Washington, DC, USA, 397--403. https://doi.org/10.1109/ICPPW.2012.57Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Asit K. Mishra, Joseph L. Hellerstein, Walfredo Cirne, and Chita R. Das. 2010. Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters. SIGMETRICS Performance Evaluation Review 37, 4 (March 2010), 34--41. https://doi.org/10.1145/1773394.1773400Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Managing Performance Interference Effects for QoS-aware Clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, New York, NY, USA, 237--250. https://doi.org/10.1145/1755913.1755938Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 69--84. https://doi.org/10.1145/2517349.2522716Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pradeep Padala, Kai-Yuan Hou, Kang G Shin, Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal, and Arif Merchant. 2009. Automated control of multiple virtualized resources. In Proceedings of the 4th ACM European conference on Computer systems. ACM, 13--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nohhyun Park, Irfan Ahmad, and David J. Lilja. 2012. Romano: Autonomous Storage Management Using Performance Prediction in Multi-tenant Datacenters. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, New York, NY, USA, Article 21, 14 pages. https://doi.org/10.1145/2391229.2391250Google ScholarGoogle Scholar
  37. Rightscale. 2019. RightScale 2019 State of the Cloud Report.Google ScholarGoogle Scholar
  38. Francesca Rossi, Peter van Beek, and Toby Walsh (Eds.). 2006. Handbook of Constraint Programming. Elsevier Science Inc., New York, NY, USA.Google ScholarGoogle Scholar
  39. Paul Shaw. 2004. A Constraint for Bin Packing. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 648--662. https://doi.org/10.1007/978-3-540-30201-8_47Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Shen, R. Sundaram, A. Russell, S. Aiyar, K. Gupta, A. Nagpal, A. Ramesh, and H. Shukla. 2017. High Availability for VM Placement and a Stochastic Model for Multiple Knapsack. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038384Google ScholarGoogle Scholar
  41. Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. 2008. Server-storage virtualization: integration and load balancing in data centers. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 53.Google ScholarGoogle ScholarCross RefCross Ref
  42. Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, New York, NY, USA, Article 35, 16 pages. https://doi.org/10.1145/2901318.2901355Google ScholarGoogle Scholar
  43. Akshat Verma, Puneet Ahuja, and Anindya Neogi. 2008. pMapper: power and migration cost aware application placement in virtualized systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc., 243--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Akshat Verma, Juhi Bagrodia, and Vimmi Jaiswal. 2014. Virtual Machine Consolidation in the Wild. In Proceedings of the 15th International Middleware Conference. ACM, New York, NY, USA, 313--324. https://doi.org/10.1145/2663165.2663316Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, NewYork, NY, USA, Article 18, 17 pages. https://doi.org/10.1145/2741948.2741964Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. VMWare. 2018. VMWare vSAN 6.6 Technical Overview.Google ScholarGoogle Scholar
  47. Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, and Onur Mutlu. 2015. A-DRM: Architecture-aware distributed resource management of virtualized clusters. ACM SIGPLAN Notices 50, 7 (2015), 93--106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif. 2007. Black-box and Gray-box Strategies for Virtual Machine Migration. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. USENIX Association, Berkeley, CA, USA, 17--17. http://dl.acm.org/citation.cfm?id=1973430.1973447Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhen Xiao, Weijia Song, and Qi Chen. 2013. Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Transactions on parallel and distributed systems 24, 6 (2013), 1107--1117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, USA, 379--391. https://doi.org/10.1145/2465351.2465388Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hotspot Mitigations for the Masses

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
        November 2019
        503 pages
        ISBN:9781450369732
        DOI:10.1145/3357223

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 November 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        SoCC '19 Paper Acceptance Rate39of157submissions,25%Overall Acceptance Rate169of722submissions,23%
      • Article Metrics

        • Downloads (Last 12 months)3
        • Downloads (Last 6 weeks)2

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader