research-article

Hotspot Mitigations for the Masses

Authors:
Fabien Hermenier

Nutanix

Nutanix
View Profile

,
Aditya Ramesh

Nutanix

Nutanix
View Profile

,
Abhinay Nagpal

Nutanix

Nutanix
View Profile

,
Himanshu Shukla

Nutanix

Nutanix
View Profile

,
Ramesh Chandra

Nutanix

Nutanix
View Profile

SoCC '19: Proceedings of the ACM Symposium on Cloud ComputingNovember 2019Pages 102–113https://doi.org/10.1145/3357223.3362717

Published:20 November 2019Publication History

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

Pages 102–113

ABSTRACT

In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible.

The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.

References

Roberto Amadini, Maurizio Gabbrielli, and Jacopo Mauro. 2015. A Multicore Tool for Constraint Solving. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 232--238. http://dl.acm.org/citation.cfm?id=2832249.2832281Google ScholarDigital Library
George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the Diversity of Cluster Workloads and Its Impact on Research Results. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, Berkeley, CA, USA, 533--546. http://dl.acm.org/citation.cfm?id=3277355.3277407Google ScholarDigital Library
Eyal Bin, Ofer Biran, Odellia Boni, Erez Hadad, Eliot K. Kolodner, Yosef Moatti, and Dean H. Lorenz. 2011. Guaranteeing High Availability Goals for Virtual Machine Placement. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, USA, 700--709. https://doi.org/10.1109/ICDCS.2011.72Google Scholar
J. Blazewicz, J.K. Lenstra, and A.H.G.Rinnooy Kan. 1983. Scheduling subject to resource constraints: classification and complexity. Discrete Applied Mathematics 5, 1 (1983), 11--24. https://doi.org/10.1016/0166-218X(83)90012-4Google ScholarCross Ref
Norman Bobroff, Andrzej Kochut, and Kirk Beaty. 2007. Dynamic Placement of Virtual Machines for Managing SLA Violations. In 2007 10th IFIP/IEEE International Symposium on Integrated Network Management. IEEE, 119--128.Google Scholar
Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google Scholar
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071Google ScholarDigital Library
Ignacio Cano, Srinivas Aiyar, and Arvind Krishnamurthy. 2016. Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 29--41. https://doi.org/10.1145/2987550.2987584Google ScholarDigital Library
Henri Casanova, Arnaud Legrand, and Martin Quinson. 2008. SimGrid: A Generic Framework for Large-Scale Distributed Experiments. In Proceedings of the Tenth International Conference on Computer Modeling and Simulation. IEEE Computer Society, Washington, DC, USA, 126--131. https://doi.org/10.1109/UKSIM.2008.28Google ScholarDigital Library
Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2. USENIX Association, Berkeley, CA, USA, 273--286. http://dl.acm.org/citation.cfm?id=1251203.1251223Google ScholarDigital Library
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, New York, NY, USA, 153--167. https://doi.org/10.1145/3132747.3132772Google ScholarDigital Library
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. SIGPLAN Not. 48, 4 (March 2013), 77--88. https://doi.org/10.1145/2499368.2451125Google Scholar
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. SIGPLAN Not. 49, 4 (Feb. 2014), 127--144. https://doi.org/10.1145/2644865.2541941Google Scholar
Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 97--110. https://doi.org/10.1145/2806777.2806779Google ScholarDigital Library
Edison Group. 2018. Hyper-Converged Infrastructure Portfolio Comparison. https://www.emc.com/collateral/analyst-reports/edison-dellemc-hci-competitive.pdf.Google Scholar
Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, New York, NY, USA, Article 4, 13 pages. https://doi.org/10.1145/3190508.3190549Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 29--43. https://doi.org/10.1145/945445.945450Google ScholarDigital Library
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 99--115. http://dl.acm.org/citation.cfm?id=3026877.3026886Google ScholarDigital Library
Ajay Gulati, Anne Holler, Minwen Ji, Ganesha Shanmuganathan, Carl Wald-spurger, and Xiaoyun Zhu. 2012. Vmware distributed resource management: Design, implementation, and lessons learned. VMware Technical Journal 1, 1 (2012), 45--64.Google Scholar
Ajay Gulati, Chethan Kumar, Irfan Ahmad, and Karan Kumar. 2010. BASIL: Automated IO Load Balancing Across Storage Devices.. In FAST, Vol. 10. 169--182.Google Scholar
Ajay Gulati, Ganesha Shanmuganathan, Irfan Ahmad, Carl Waldspurger, and Mustafa Uysal. 2011. Pesto: online storage performance management in virtualized datacenters. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 19.Google ScholarDigital Library
Fabien Hermenier, Sophie Demassey, and Xavier Lorca. 2011. Bin Repacking Scheduling in Virtualized Datacenters. In Proceedings of the 17th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 27--41. http://dl.acm.org/citation.cfm?id=2041160.2041167Google ScholarDigital Library
Fabien Hermenier, Julia Lawall, and Gilles Muller. 2013. BtrPlace: A Flexible Consolidation Manager for Highly Available Applications. IEEE Transactions on Dependable Secure Computing 10, 5 (Sept. 2013), 273--286. https://doi.org/10.1109/TDSC.2013.5Google ScholarDigital Library
Fabien Hermenier, Xavier Lorca, Jean-Marc Menaud, Gilles Muller, and Julia Lawall. 2009. Entropy: a consolidation manager for clusters. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 41--50.Google ScholarDigital Library
Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy Live Migration of Virtual Machines. SIGOPS Oper. Syst. Rev. 43, 3 (July 2009), 14--26. https://doi.org/10.1145/1618525.1618528Google ScholarDigital Library
T. Hirofuchi, A. Lebre, and L. Pouilloux. 2018. SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems. IEEE Transactions on Cloud Computing 6, 1 (Jan 2018), 221--234. https://doi.org/10.1109/TCC.2015.2481422Google ScholarCross Ref
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, New York, NY, USA, 261--276. https://doi.org/10.1145/1629575.1629601Google ScholarDigital Library
Changyeon Jo, Youngsu Cho, and Bernhard Egger. 2017. A Machine Learning Approach to Live Migration Modeling. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, New York, NY, USA, 351--364. https://doi.org/10.1145/3127479.3129262Google ScholarDigital Library
Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. KVM: the Linux virtual machine monitor. In Ottawa Linux Symposium. 225--230.Google Scholar
Joseph J. LaViola. 2003. Double Exponential Smoothing: An Alternative to Kalman Filter-based Predictive Tracking. In Proceedings of the Workshop on Virtual Environments 2003. ACM, New York, NY, USA, 199--206. https://doi.org/10.1145/769953.769976Google ScholarDigital Library
Zitao Liu and Sangyeun Cho. 2012. Characterizing Machines and Workloads on a Google Cluster. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. IEEE Computer Society, Washington, DC, USA, 397--403. https://doi.org/10.1109/ICPPW.2012.57Google ScholarDigital Library
Asit K. Mishra, Joseph L. Hellerstein, Walfredo Cirne, and Chita R. Das. 2010. Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters. SIGMETRICS Performance Evaluation Review 37, 4 (March 2010), 34--41. https://doi.org/10.1145/1773394.1773400Google ScholarDigital Library
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Managing Performance Interference Effects for QoS-aware Clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, New York, NY, USA, 237--250. https://doi.org/10.1145/1755913.1755938Google ScholarDigital Library
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 69--84. https://doi.org/10.1145/2517349.2522716Google ScholarDigital Library
Pradeep Padala, Kai-Yuan Hou, Kang G Shin, Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal, and Arif Merchant. 2009. Automated control of multiple virtualized resources. In Proceedings of the 4th ACM European conference on Computer systems. ACM, 13--26.Google ScholarDigital Library
Nohhyun Park, Irfan Ahmad, and David J. Lilja. 2012. Romano: Autonomous Storage Management Using Performance Prediction in Multi-tenant Datacenters. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, New York, NY, USA, Article 21, 14 pages. https://doi.org/10.1145/2391229.2391250Google Scholar
Rightscale. 2019. RightScale 2019 State of the Cloud Report.Google Scholar
Francesca Rossi, Peter van Beek, and Toby Walsh (Eds.). 2006. Handbook of Constraint Programming. Elsevier Science Inc., New York, NY, USA.Google Scholar
Paul Shaw. 2004. A Constraint for Bin Packing. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 648--662. https://doi.org/10.1007/978-3-540-30201-8_47Google ScholarDigital Library
B. Shen, R. Sundaram, A. Russell, S. Aiyar, K. Gupta, A. Nagpal, A. Ramesh, and H. Shukla. 2017. High Availability for VM Placement and a Stochastic Model for Multiple Knapsack. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038384Google Scholar
Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. 2008. Server-storage virtualization: integration and load balancing in data centers. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 53.Google ScholarCross Ref
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, New York, NY, USA, Article 35, 16 pages. https://doi.org/10.1145/2901318.2901355Google Scholar
Akshat Verma, Puneet Ahuja, and Anindya Neogi. 2008. pMapper: power and migration cost aware application placement in virtualized systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc., 243--264.Google ScholarDigital Library
Akshat Verma, Juhi Bagrodia, and Vimmi Jaiswal. 2014. Virtual Machine Consolidation in the Wild. In Proceedings of the 15th International Middleware Conference. ACM, New York, NY, USA, 313--324. https://doi.org/10.1145/2663165.2663316Google ScholarDigital Library
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, NewYork, NY, USA, Article 18, 17 pages. https://doi.org/10.1145/2741948.2741964Google ScholarDigital Library
VMWare. 2018. VMWare vSAN 6.6 Technical Overview.Google Scholar
Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, and Onur Mutlu. 2015. A-DRM: Architecture-aware distributed resource management of virtualized clusters. ACM SIGPLAN Notices 50, 7 (2015), 93--106.Google ScholarDigital Library
Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif. 2007. Black-box and Gray-box Strategies for Virtual Machine Migration. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. USENIX Association, Berkeley, CA, USA, 17--17. http://dl.acm.org/citation.cfm?id=1973430.1973447Google ScholarDigital Library
Zhen Xiao, Weijia Song, and Qi Chen. 2013. Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Transactions on parallel and distributed systems 24, 6 (2013), 1107--1117.Google ScholarDigital Library
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, USA, 379--391. https://doi.org/10.1145/2465351.2465388Google ScholarDigital Library

Index Terms

Hotspot Mitigations for the Masses
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
      2. Software infrastructure
        Virtual machines

Recommendations

Virtual Infrastructure Management in Private and Hybrid Clouds

One of the many definitions of "cloud" is that of an infrastructure-as-a-service (IaaS) system, in which IT infrastructure is deployed in a provider's data center as virtual machines. With IaaS clouds' growing popularity, tools and technologies are ...
Read More
Evaluation of gang scheduling performance and cost in a cloud computing system

Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage to an off-site, location-transparent centralized facility or "Cloud." Gang Scheduling is an efficient job scheduling algorithm for ...
Read More
Performance Evaluation of Hypervisors for Cloud Computing

The virtualization of IT infrastructure enables consolidation and pooling of IT resources so they are shared over diverse applications to offset the limitation of shrinking resources and growing business needs. Virtualization provides a logical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cloud Computing
Dynamic Scheduling
Virtual Machines
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SoCC '19 Paper Acceptance Rate39of157submissions,25%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hotspot Mitigations for the Masses

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Virtual Infrastructure Management in Private and Hybrid Clouds

Evaluation of gang scheduling performance and cost in a cloud computing system

Performance Evaluation of Hypervisors for Cloud Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hotspot Mitigations for the Masses

SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Virtual Infrastructure Management in Private and Hybrid Clouds

Evaluation of gang scheduling performance and cost in a cloud computing system

Performance Evaluation of Hypervisors for Cloud Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media