ABSTRACT
In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible.
The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.
- Roberto Amadini, Maurizio Gabbrielli, and Jacopo Mauro. 2015. A Multicore Tool for Constraint Solving. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 232--238. http://dl.acm.org/citation.cfm?id=2832249.2832281Google ScholarDigital Library
- George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the Diversity of Cluster Workloads and Its Impact on Research Results. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, Berkeley, CA, USA, 533--546. http://dl.acm.org/citation.cfm?id=3277355.3277407Google ScholarDigital Library
- Eyal Bin, Ofer Biran, Odellia Boni, Erez Hadad, Eliot K. Kolodner, Yosef Moatti, and Dean H. Lorenz. 2011. Guaranteeing High Availability Goals for Virtual Machine Placement. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems. IEEE Computer Society, Washington, DC, USA, 700--709. https://doi.org/10.1109/ICDCS.2011.72Google Scholar
- J. Blazewicz, J.K. Lenstra, and A.H.G.Rinnooy Kan. 1983. Scheduling subject to resource constraints: classification and complexity. Discrete Applied Mathematics 5, 1 (1983), 11--24. https://doi.org/10.1016/0166-218X(83)90012-4Google ScholarCross Ref
- Norman Bobroff, Andrzej Kochut, and Kirk Beaty. 2007. Dynamic Placement of Virtual Machines for Managing SLA Violations. In 2007 10th IFIP/IEEE International Symposium on Integrated Network Management. IEEE, 119--128.Google Scholar
- Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google Scholar
- Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071Google ScholarDigital Library
- Ignacio Cano, Srinivas Aiyar, and Arvind Krishnamurthy. 2016. Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 29--41. https://doi.org/10.1145/2987550.2987584Google ScholarDigital Library
- Henri Casanova, Arnaud Legrand, and Martin Quinson. 2008. SimGrid: A Generic Framework for Large-Scale Distributed Experiments. In Proceedings of the Tenth International Conference on Computer Modeling and Simulation. IEEE Computer Society, Washington, DC, USA, 126--131. https://doi.org/10.1109/UKSIM.2008.28Google ScholarDigital Library
- Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2. USENIX Association, Berkeley, CA, USA, 273--286. http://dl.acm.org/citation.cfm?id=1251203.1251223Google ScholarDigital Library
- Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, New York, NY, USA, 153--167. https://doi.org/10.1145/3132747.3132772Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. SIGPLAN Not. 48, 4 (March 2013), 77--88. https://doi.org/10.1145/2499368.2451125Google Scholar
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. SIGPLAN Not. 49, 4 (Feb. 2014), 127--144. https://doi.org/10.1145/2644865.2541941Google Scholar
- Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 97--110. https://doi.org/10.1145/2806777.2806779Google ScholarDigital Library
- Edison Group. 2018. Hyper-Converged Infrastructure Portfolio Comparison. https://www.emc.com/collateral/analyst-reports/edison-dellemc-hci-competitive.pdf.Google Scholar
- Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: Scheduling of Long Running Applications in Shared Production Clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, New York, NY, USA, Article 4, 13 pages. https://doi.org/10.1145/3190508.3190549Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 29--43. https://doi.org/10.1145/945445.945450Google ScholarDigital Library
- Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, 99--115. http://dl.acm.org/citation.cfm?id=3026877.3026886Google ScholarDigital Library
- Ajay Gulati, Anne Holler, Minwen Ji, Ganesha Shanmuganathan, Carl Wald-spurger, and Xiaoyun Zhu. 2012. Vmware distributed resource management: Design, implementation, and lessons learned. VMware Technical Journal 1, 1 (2012), 45--64.Google Scholar
- Ajay Gulati, Chethan Kumar, Irfan Ahmad, and Karan Kumar. 2010. BASIL: Automated IO Load Balancing Across Storage Devices.. In FAST, Vol. 10. 169--182.Google Scholar
- Ajay Gulati, Ganesha Shanmuganathan, Irfan Ahmad, Carl Waldspurger, and Mustafa Uysal. 2011. Pesto: online storage performance management in virtualized datacenters. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 19.Google ScholarDigital Library
- Fabien Hermenier, Sophie Demassey, and Xavier Lorca. 2011. Bin Repacking Scheduling in Virtualized Datacenters. In Proceedings of the 17th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 27--41. http://dl.acm.org/citation.cfm?id=2041160.2041167Google ScholarDigital Library
- Fabien Hermenier, Julia Lawall, and Gilles Muller. 2013. BtrPlace: A Flexible Consolidation Manager for Highly Available Applications. IEEE Transactions on Dependable Secure Computing 10, 5 (Sept. 2013), 273--286. https://doi.org/10.1109/TDSC.2013.5Google ScholarDigital Library
- Fabien Hermenier, Xavier Lorca, Jean-Marc Menaud, Gilles Muller, and Julia Lawall. 2009. Entropy: a consolidation manager for clusters. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 41--50.Google ScholarDigital Library
- Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy Live Migration of Virtual Machines. SIGOPS Oper. Syst. Rev. 43, 3 (July 2009), 14--26. https://doi.org/10.1145/1618525.1618528Google ScholarDigital Library
- T. Hirofuchi, A. Lebre, and L. Pouilloux. 2018. SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems. IEEE Transactions on Cloud Computing 6, 1 (Jan 2018), 221--234. https://doi.org/10.1109/TCC.2015.2481422Google ScholarCross Ref
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, New York, NY, USA, 261--276. https://doi.org/10.1145/1629575.1629601Google ScholarDigital Library
- Changyeon Jo, Youngsu Cho, and Bernhard Egger. 2017. A Machine Learning Approach to Live Migration Modeling. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, New York, NY, USA, 351--364. https://doi.org/10.1145/3127479.3129262Google ScholarDigital Library
- Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. KVM: the Linux virtual machine monitor. In Ottawa Linux Symposium. 225--230.Google Scholar
- Joseph J. LaViola. 2003. Double Exponential Smoothing: An Alternative to Kalman Filter-based Predictive Tracking. In Proceedings of the Workshop on Virtual Environments 2003. ACM, New York, NY, USA, 199--206. https://doi.org/10.1145/769953.769976Google ScholarDigital Library
- Zitao Liu and Sangyeun Cho. 2012. Characterizing Machines and Workloads on a Google Cluster. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. IEEE Computer Society, Washington, DC, USA, 397--403. https://doi.org/10.1109/ICPPW.2012.57Google ScholarDigital Library
- Asit K. Mishra, Joseph L. Hellerstein, Walfredo Cirne, and Chita R. Das. 2010. Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters. SIGMETRICS Performance Evaluation Review 37, 4 (March 2010), 34--41. https://doi.org/10.1145/1773394.1773400Google ScholarDigital Library
- Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Managing Performance Interference Effects for QoS-aware Clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, New York, NY, USA, 237--250. https://doi.org/10.1145/1755913.1755938Google ScholarDigital Library
- Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 69--84. https://doi.org/10.1145/2517349.2522716Google ScholarDigital Library
- Pradeep Padala, Kai-Yuan Hou, Kang G Shin, Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal, and Arif Merchant. 2009. Automated control of multiple virtualized resources. In Proceedings of the 4th ACM European conference on Computer systems. ACM, 13--26.Google ScholarDigital Library
- Nohhyun Park, Irfan Ahmad, and David J. Lilja. 2012. Romano: Autonomous Storage Management Using Performance Prediction in Multi-tenant Datacenters. In Proceedings of the Third ACM Symposium on Cloud Computing. ACM, New York, NY, USA, Article 21, 14 pages. https://doi.org/10.1145/2391229.2391250Google Scholar
- Rightscale. 2019. RightScale 2019 State of the Cloud Report.Google Scholar
- Francesca Rossi, Peter van Beek, and Toby Walsh (Eds.). 2006. Handbook of Constraint Programming. Elsevier Science Inc., New York, NY, USA.Google Scholar
- Paul Shaw. 2004. A Constraint for Bin Packing. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming. Springer-Verlag, Berlin, Heidelberg, 648--662. https://doi.org/10.1007/978-3-540-30201-8_47Google ScholarDigital Library
- B. Shen, R. Sundaram, A. Russell, S. Aiyar, K. Gupta, A. Nagpal, A. Ramesh, and H. Shukla. 2017. High Availability for VM Placement and a Stochastic Model for Multiple Knapsack. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9. https://doi.org/10.1109/ICCCN.2017.8038384Google Scholar
- Aameek Singh, Madhukar Korupolu, and Dushmanta Mohapatra. 2008. Server-storage virtualization: integration and load balancing in data centers. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 53.Google ScholarCross Ref
- Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, New York, NY, USA, Article 35, 16 pages. https://doi.org/10.1145/2901318.2901355Google Scholar
- Akshat Verma, Puneet Ahuja, and Anindya Neogi. 2008. pMapper: power and migration cost aware application placement in virtualized systems. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc., 243--264.Google ScholarDigital Library
- Akshat Verma, Juhi Bagrodia, and Vimmi Jaiswal. 2014. Virtual Machine Consolidation in the Wild. In Proceedings of the 15th International Middleware Conference. ACM, New York, NY, USA, 313--324. https://doi.org/10.1145/2663165.2663316Google ScholarDigital Library
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, NewYork, NY, USA, Article 18, 17 pages. https://doi.org/10.1145/2741948.2741964Google ScholarDigital Library
- VMWare. 2018. VMWare vSAN 6.6 Technical Overview.Google Scholar
- Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, and Onur Mutlu. 2015. A-DRM: Architecture-aware distributed resource management of virtualized clusters. ACM SIGPLAN Notices 50, 7 (2015), 93--106.Google ScholarDigital Library
- Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif. 2007. Black-box and Gray-box Strategies for Virtual Machine Migration. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. USENIX Association, Berkeley, CA, USA, 17--17. http://dl.acm.org/citation.cfm?id=1973430.1973447Google ScholarDigital Library
- Zhen Xiao, Weijia Song, and Qi Chen. 2013. Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Transactions on parallel and distributed systems 24, 6 (2013), 1107--1117.Google ScholarDigital Library
- Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI2: CPU Performance Isolation for Shared Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, USA, 379--391. https://doi.org/10.1145/2465351.2465388Google ScholarDigital Library
Index Terms
- Hotspot Mitigations for the Masses
Recommendations
Virtual Infrastructure Management in Private and Hybrid Clouds
One of the many definitions of "cloud" is that of an infrastructure-as-a-service (IaaS) system, in which IT infrastructure is deployed in a provider's data center as virtual machines. With IaaS clouds' growing popularity, tools and technologies are ...
Evaluation of gang scheduling performance and cost in a cloud computing system
Cloud Computing refers to the notion of outsourcing on-site available services, computational facilities, or data storage to an off-site, location-transparent centralized facility or "Cloud." Gang Scheduling is an efficient job scheduling algorithm for ...
Performance Evaluation of Hypervisors for Cloud Computing
The virtualization of IT infrastructure enables consolidation and pooling of IT resources so they are shared over diverse applications to offset the limitation of shrinking resources and growing business needs. Virtualization provides a logical ...
Comments