research-article

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters

Authors:

Srinivas Aiyar,

Arvind KrishnamurthyAuthors Info & Claims

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 29 - 41

https://doi.org/10.1145/2987550.2987584

Published: 05 October 2016 Publication History

Abstract

There is an increasing trend in the use of on-premise clusters within companies. Security, regulatory constraints, and enhanced service quality push organizations to work in these so called private cloud environments. On the other hand, the deployment of private enterprise clusters requires careful consideration of what will be necessary or may happen in the future, both in terms of compute demands and failures, as they lack the public cloud's flexibility to immediately provision new nodes in case of demand spikes or node failures.

In order to better understand the challenges and tradeoffs of operating in private settings, we perform, to the best of our knowledge, the first extensive characterization of on-premise clusters. Specifically, we analyze data ranging from hardware failures to typical compute/storage requirements and workload profiles, from a large number of Nutanix clusters deployed at various companies.

We show that private cloud hardware failure rates are lower, and that load/demand needs are more predictable than in other settings. Finally, we demonstrate the value of the measurements by using them to provide an analytical model for computing durability in private clouds, as well as a machine learning-driven approach for characterizing private clouds' growth.

References

[1]

N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata. Trans. Storage, 3(3), Oct. 2007.

Digital Library

[2]

D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.

Digital Library

[3]

P. Gill, N. Jain, and N. Nagappan. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 350--361, New York, NY, USA, 2011. ACM.

Digital Library

[4]

J. Gray. Why do computers stop and what can be done about it? In Symposium on Reliability in Distributed Software and Database Systems, pages 3--12. IEEE Computer Society, 1986.

[5]

T. Harter, C. Dragga, M. Vaughn, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. A file is not a file: Understanding the I/O behavior of apple desktop applications. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 71--83, New York, NY, USA, 2011. ACM.

Digital Library

[6]

W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. Trans. Storage, 4(3):7:1--7:25, Nov. 2008.

Digital Library

[7]

A. W. Leung, S. Pasupathy, G. Goodson, and E. L. Miller. Measurement and analysis of large-scale network file system workloads. In USENIX 2008 Annual Technical Conference, ATC'08, pages 213--226, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

[8]

D. T. Meyer and W. J. Bolosky. A study of practical deduplication. Trans. Storage, 7(4), 2012.

Digital Library

[9]

J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS Conference, SIGMETRICS '15, pages 177--190, New York, NY, USA, 2015. ACM.

Digital Library

[10]

A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: Insights from google compute clusters. SIGMETRICS Perform. Eval. Rev., 37(4):34--41, Mar. 2010.

Digital Library

[11]

K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, and T. D. Nguyen. Understanding and dealing with operator mistakes in internet services. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 5--5, Berkeley, CA, USA, 2004. USENIX Association.

Digital Library

[12]

E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer pcs. In Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, pages 343--356, New York, NY, USA, 2011. ACM.

Digital Library

[13]

D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about it? In Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume 4, USITS'03, pages 1--1, Berkeley, CA, USA, 2003. USENIX Association.

Digital Library

[14]

E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[15]

C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 7:1--7:13, New York, NY, USA, 2012. ACM.

Digital Library

[16]

C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Technical report, Google Inc., Mountain View, CA, USA, Nov. 2011.

[17]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '06, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[18]

B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[19]

B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 67--80, Santa Clara, CA, 2016. USENIX Association.

Digital Library

[20]

B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM SIGMETRICS Conference, SIGMETRICS '09, pages 193--204, New York, NY, USA, 2009. ACM.

Digital Library

[21]

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.

[22]

K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 193--204, New York, NY, USA, 2010. ACM.

Digital Library

[23]

J. Wilkes. More Google cluster data. Google research blog, Nov. 2011.

[24]

J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked windows nt system field failure data analysis. 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, 0:178, 1999.

Digital Library

[25]

Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S. Pasupathy. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 159--172, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Senevirathne PCooray SDinal Herath JFernando D(2024)Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionIEEE Access10.1109/ACCESS.2024.350683312(178951-178970)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3506833
Qin XMa MZhao YZhang JDu CLiu YParayil ABansal CRajmohan SGoiri ÍCortez EQin SLin QZhang D(2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023
https://doi.org/10.1109/DSN58367.2023.00055
Sindhu KSeshadri KKollengode C(2022)Workload characterization and synthesis for cloud using generative stochastic processesThe Journal of Supercomputing10.1007/s11227-022-04597-y78:17(18825-18855)Online publication date: 13-Jun-2022
https://doi.org/10.1007/s11227-022-04597-y
Show More Cited By

Index Terms

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters

Recommendations

Elasticity Management in Private and Hybrid Clouds
CLOUD '14: Proceedings of the 2014 IEEE International Conference on Cloud Computing

Cloud computing requires elasticity management for dynamically allocating and releasing resources. Even though the adoption of cloud services has been growing, there is little knowledge available for guiding users when they need to manage elasticity. ...
Collaborative learning in the clouds

Cloud computing has emerged as one of the most highly discussed topics both in the academic community and in the computing industry. While most of the work that has been conducted to explore this field focuses either on establishing the basis for cloud ...
Cloud Computing: The Limits of Public Clouds for Business Applications

The cloud computing model — especially the public cloud — is unsuited to many business applications and is likely to remain so for many years due to fundamental limitations in architecture and design. Enterprises that move their IT to the cloud are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

October 2016

534 pages

ISBN:9781450345255

DOI:10.1145/2987550

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '16

Sponsor:

SoCC '16: ACM Symposium on Cloud Computing

October 5 - 7, 2016

CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
922
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Senevirathne PCooray SDinal Herath JFernando D(2024)Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionIEEE Access10.1109/ACCESS.2024.350683312(178951-178970)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3506833
Qin XMa MZhao YZhang JDu CLiu YParayil ABansal CRajmohan SGoiri ÍCortez EQin SLin QZhang D(2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023
https://doi.org/10.1109/DSN58367.2023.00055
Sindhu KSeshadri KKollengode C(2022)Workload characterization and synthesis for cloud using generative stochastic processesThe Journal of Supercomputing10.1007/s11227-022-04597-y78:17(18825-18855)Online publication date: 13-Jun-2022
https://doi.org/10.1007/s11227-022-04597-y
Ruan LJiao YLin TXiao LMin-Allah NRen L(2021)Diagnosing Evolution of Cloud Cluster via Spatio-temporal Trace AnalysisJournal of Circuits, Systems and Computers10.1142/S021812662250069431:04Online publication date: 15-Nov-2021
https://doi.org/10.1142/S0218126622500694
Bhaumik SChakraborty S(2020)Managing Container QoS with Network and Storage Workloads over a Hyperconverged Platform2020 IEEE 45th Conference on Local Computer Networks (LCN)10.1109/LCN48667.2020.9314802(112-123)Online publication date: 16-Nov-2020
https://doi.org/10.1109/LCN48667.2020.9314802
Zhong ZHe JRodriguez MErfani SKotagiri RBuyya R(2020)Heterogeneous Task Co-location in Containerized Cloud Computing Environments2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC49007.2020.00021(79-88)Online publication date: May-2020
https://doi.org/10.1109/ISORC49007.2020.00021
Lu CChen WYe KXu C(2020)Understanding the Workload Characteristics in Alibaba: A View from Directed Acyclic Graph Analysis2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS)10.1109/HPBDIS49115.2020.9130578(1-8)Online publication date: May-2020
https://doi.org/10.1109/HPBDIS49115.2020.9130578
Panda BSrinivasan DKe HGupta KKhot VGunawi HDan TDahlia M(2019)IASOProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358812(47-61)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358812
Hermenier FRamesh ANagpal AShukla HChandra R(2019)Hotspot Mitigations for the MassesProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362717(102-113)Online publication date: 20-Nov-2019
https://dl.acm.org/doi/10.1145/3357223.3362717
Ruan LXu XXiao LYuan FLi YDai D(2019)A Comparative Study of Large-Scale Cluster Workload Traces via Multiview Analysis2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00067(397-404)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00067
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten