skip to main content
10.1145/2987550.2987584acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters

Published: 05 October 2016 Publication History

Abstract

There is an increasing trend in the use of on-premise clusters within companies. Security, regulatory constraints, and enhanced service quality push organizations to work in these so called private cloud environments. On the other hand, the deployment of private enterprise clusters requires careful consideration of what will be necessary or may happen in the future, both in terms of compute demands and failures, as they lack the public cloud's flexibility to immediately provision new nodes in case of demand spikes or node failures.
In order to better understand the challenges and tradeoffs of operating in private settings, we perform, to the best of our knowledge, the first extensive characterization of on-premise clusters. Specifically, we analyze data ranging from hardware failures to typical compute/storage requirements and workload profiles, from a large number of Nutanix clusters deployed at various companies.
We show that private cloud hardware failure rates are lower, and that load/demand needs are more predictable than in other settings. Finally, we demonstrate the value of the measurements by using them to provide an analytical model for computing durability in private clouds, as well as a machine learning-driven approach for characterizing private clouds' growth.

References

[1]
N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata. Trans. Storage, 3(3), Oct. 2007.
[2]
D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
[3]
P. Gill, N. Jain, and N. Nagappan. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 350--361, New York, NY, USA, 2011. ACM.
[4]
J. Gray. Why do computers stop and what can be done about it? In Symposium on Reliability in Distributed Software and Database Systems, pages 3--12. IEEE Computer Society, 1986.
[5]
T. Harter, C. Dragga, M. Vaughn, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. A file is not a file: Understanding the I/O behavior of apple desktop applications. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 71--83, New York, NY, USA, 2011. ACM.
[6]
W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. Trans. Storage, 4(3):7:1--7:25, Nov. 2008.
[7]
A. W. Leung, S. Pasupathy, G. Goodson, and E. L. Miller. Measurement and analysis of large-scale network file system workloads. In USENIX 2008 Annual Technical Conference, ATC'08, pages 213--226, Berkeley, CA, USA, 2008. USENIX Association.
[8]
D. T. Meyer and W. J. Bolosky. A study of practical deduplication. Trans. Storage, 7(4), 2012.
[9]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS Conference, SIGMETRICS '15, pages 177--190, New York, NY, USA, 2015. ACM.
[10]
A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: Insights from google compute clusters. SIGMETRICS Perform. Eval. Rev., 37(4):34--41, Mar. 2010.
[11]
K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, and T. D. Nguyen. Understanding and dealing with operator mistakes in internet services. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 5--5, Berkeley, CA, USA, 2004. USENIX Association.
[12]
E. B. Nightingale, J. R. Douceur, and V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer pcs. In Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, pages 343--356, New York, NY, USA, 2011. ACM.
[13]
D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do internet services fail, and what can be done about it? In Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume 4, USITS'03, pages 1--1, Berkeley, CA, USA, 2003. USENIX Association.
[14]
E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association.
[15]
C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 7:1--7:13, New York, NY, USA, 2012. ACM.
[16]
C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Technical report, Google Inc., Mountain View, CA, USA, Nov. 2011.
[17]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '06, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.
[18]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, Berkeley, CA, USA, 2007. USENIX Association.
[19]
B. Schroeder, R. Lagisetty, and A. Merchant. Flash reliability in production: The expected and the unexpected. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 67--80, Santa Clara, CA, 2016. USENIX Association.
[20]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM SIGMETRICS Conference, SIGMETRICS '09, pages 193--204, New York, NY, USA, 2009. ACM.
[21]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.
[22]
K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 193--204, New York, NY, USA, 2010. ACM.
[23]
J. Wilkes. More Google cluster data. Google research blog, Nov. 2011.
[24]
J. Xu, Z. Kalbarczyk, and R. K. Iyer. Networked windows nt system field failure data analysis. 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, 0:178, 1999.
[25]
Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S. Pasupathy. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 159--172, New York, NY, USA, 2011. ACM.

Cited By

View all
  • (2024)Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionIEEE Access10.1109/ACCESS.2024.350683312(178951-178970)Online publication date: 2024
  • (2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023
  • (2022)Workload characterization and synthesis for cloud using generative stochastic processesThe Journal of Supercomputing10.1007/s11227-022-04597-y78:17(18825-18855)Online publication date: 13-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
October 2016
534 pages
ISBN:9781450345255
DOI:10.1145/2987550
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Measurements
  2. Performance
  3. Private clouds
  4. Reliability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '16
Sponsor:
SoCC '16: ACM Symposium on Cloud Computing
October 5 - 7, 2016
CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionIEEE Access10.1109/ACCESS.2024.350683312(178951-178970)Online publication date: 2024
  • (2023)How Different are the Cloud Workloads? Characterizing Large-Scale Private and Public Cloud Workloads2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00055(522-530)Online publication date: Jun-2023
  • (2022)Workload characterization and synthesis for cloud using generative stochastic processesThe Journal of Supercomputing10.1007/s11227-022-04597-y78:17(18825-18855)Online publication date: 13-Jun-2022
  • (2021)Diagnosing Evolution of Cloud Cluster via Spatio-temporal Trace AnalysisJournal of Circuits, Systems and Computers10.1142/S021812662250069431:04Online publication date: 15-Nov-2021
  • (2020)Managing Container QoS with Network and Storage Workloads over a Hyperconverged Platform2020 IEEE 45th Conference on Local Computer Networks (LCN)10.1109/LCN48667.2020.9314802(112-123)Online publication date: 16-Nov-2020
  • (2020)Heterogeneous Task Co-location in Containerized Cloud Computing Environments2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC49007.2020.00021(79-88)Online publication date: May-2020
  • (2020)Understanding the Workload Characteristics in Alibaba: A View from Directed Acyclic Graph Analysis2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS)10.1109/HPBDIS49115.2020.9130578(1-8)Online publication date: May-2020
  • (2019)IASOProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358812(47-61)Online publication date: 10-Jul-2019
  • (2019)Hotspot Mitigations for the MassesProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362717(102-113)Online publication date: 20-Nov-2019
  • (2019)A Comparative Study of Large-Scale Cluster Workload Traces via Multiview Analysis2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00067(397-404)Online publication date: Aug-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media