Abstract
For years Capacity Planning professionals knew or suspected that various characteristics of computer usage have non-normal distribution. At the same time much of the traditional workload modeling and forecasting is based on mathematical techniques assuming some sort of normality of underlying distributions. If the dissonance between the existing and assumed distribution exists, then resulting capacity models are of lower quality, with possibly erroneous forecasts—and confidence intervals much wider than expected. This paper analyzes distribution of daily resource usage on three storage clusters for 478 days. For each day we consider the distribution of resource usage by customer accounts for five different resources: storage used, storage transactions executed, internal network transfer, egress transfer and inter-data-center transfer—7170 sample distributions in total. All distributions were highly imbalanced and most distribution samples have tails heavier than log-normal, exponential, or normal distributions. These findings spell significant problems for most models assuming normality. Mathematically: Central Limit Theorem does not apply to power-law distributions—so the ‘averaging’ effect cannot be counted on to help with modeling using traditional approach. Operationally: very high volatility found means that the ‘capacity buffers’ need to be large, leading to wasted capacity. Other, administrative, means need to be applied to reduce that. Overall the distributions of resource usage in cloud storage are so far from normal, even after usual transformations, that traditional approach to forecasting and capacity planning needs to be reconsidered. The distributions of log-returns of time series describing resource usage are much more heavy-tailed than similar distributions for stock indexes. Since no financial professional would use linear regression for stock market analysis and forecasting—it stands to reason that capacity planning should move toward employing tools accounting for heavy-tailed distributions, too.
Similar content being viewed by others
References
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009)
Clauset, A., Young, M., Gleditsch, K.S.: J. Conflict Resolut. 51, 58 (2007)
Goldstein, M.L., Morri, S.A., Yen, G.G.: Problems with fitting to the power-law distribution. Eur. Phys. J. B. 41(2), 255–258 (2004)
Gunther, N.: Guerilla capacity planning. iUniverse (October 31, 2000), ISBN-10: 3642065570
James, A., Plank, M.J.: On fitting power laws to ecological data arxiv:0712.06131
Leland, W., Taqqu, M., Willinger, W., Wilson, D.: On the self-similar nature of ethernet traffic, IEEE/ACM TON (1994)
Lilifoers, H.W.: J. Amer. Statist. Assoc. 64, 387–389 (1969)
Mantegna, R.N., Stanley, H.E.: An Introduction to Econophysics: Correlations and Complexity in Finance. Cambridge University Press, Cambridge (1999)
Marvasti, M.A.: How ‘Normal’ is your IT data. Proceedings of the Computer Measurement Group’s 2009 International Conference, www.cmg.org
Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2006)
Shalizi, C.: Power law distributions, 1/f Noise, Long-Memory Time Series http://cscs.umich.edu/~crshalizi/notabene/power-laws.html
Van der Loo, M.P.J.: Distribution based outlier detection in univariate data, discussion paper 10003, Statistic Netherlands
Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R.: A five-year study of file-system metadata. Trans. Storage 3,3,Article 9 (October 2007). doi:10.1145/1288783.1288788
Li, H.: Workload dynamics on clusters and grids. J. Supercomput. 47(1), (2009)
Li, H., Muskulus, M., Wolters, L.: Modeling job arrivals in a data-intensive grid. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) Int’l. Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Revised Selected Papers, In: Lecture Notes in Computer Science, vol. 4376, pp. 210–231. Springer (2007)
Litzkow, M.J., Livny, M., Mutka, M.W.: Condor-a hunter of idle workstations, 8th International Conference on Distributed Computing Systems, pp. 104–111 (1988)
Iosup, A., Li, H., Jan, M., Anoep, S, Dumitrescu, C., Wolters, L., Dick, H., Epema, J.: The grid workloads archive. Future Gener. Comp. Sy. 24(7), 672–686 (2008)
Li, H., Heusdens, R., Muskulus, M.V., Wolters, L.: Analysis and synthesis of pseudo-periodic job arrivals in grids: a matching pursuit approach IEEE/ACM Intl. Symp. on Cluster Computing and the Grid (CCGrid) IEEE Computer Society, pp. 183–196 (2007)
Li, H., Muskulus, M., Wolters, L.: Modeling job arrivals in a data-intensive grid. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) Int’l. Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), Revised Selected Papers, In: Lecture Notes in Computer Science, vol. 4376, pp. 210–231. Springer (2007)
Li, H., Wolters, L.: Towards a better understanding of workload dynamics on data-intensive clusters and grids. In: Int’l. Parallel &Distributed Processing Symposium (IPDPS), IEEE Computer Society, pp. 1–10 (2007)
Li, H.: Workload characterization, modeling, and prediction in grid computing. PhD thesis, https://openaccess.leidenuniv.nl/bitstream/1887/12574/1/Thesis.pdf
Park, C., Hernandez-Campos, F., Marron, J.S., Donelson Smith, F.: Long-range dependence in a changing internet traffic mix. Comput. Netw. 48(3), 401–422 (2005)
Allspaw, J.: The art of capacity planning: scaling web resources, O’Reilly Media; 1 edn. (September 15, 2008), ISBN-10: 0596518579
Albert, R., Barabási, A.-L.: Statistical mechanics of complex networks. Rev. Modern Phys. 74, 47–97 (2002)
Rasch, D., Guiard, V.: The robustness of parametric statistical methods. Psychol. Sci. 46(2), 175–208 (2004)
Peterson, D., Grossman, R.: Power laws in large shop DASD I/O Activity, CMG Proceedings, pp. 822–833 (Dec. 1995)
Peterson, D., Adams, D.: Fractal patterns in DASD I/O Traffic, CMG Proceedings, Dec, (1996)
Milligan, C., Peterson, D.: A practical approach for estimating true I/O skew, CMG Proceedings, pp. 970–981 (Dec. 1994)
Peterson, D.: Data center I/O patterns and power laws, CMG Proceedings (1996)
Adamic, L.A.: Zipf, Power-laws, and Pareto—a ranking tutorial. Xerox Palo Alto Research Center, Palo Alto, CA. Retrieved on 2011-07-26. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
Nicholls, P.T.: J. Am. Soc. Inform. Sci. 40, 379–385 (1989)
Author information
Authors and Affiliations
Corresponding author
Additional information
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Rights and permissions
About this article
Cite this article
Loboz, C. Cloud Resource Usage—Heavy Tailed Distributions Invalidating Traditional Capacity Planning Models. J Grid Computing 10, 85–108 (2012). https://doi.org/10.1007/s10723-012-9211-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-012-9211-x