Skip to main content
Log in

Characterizing and modeling cloud applications/jobs on a Google data center

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we characterize and model Google applications and jobs, based on a 1-month Google trace from a large-scale Google data center. We address four contributions: (1) we compute the valuable statistics about task events and resource utilization for Google applications, based on various types of resources and execution types; (2) we analyze the classification of applications via a K-means clustering algorithm with optimized number of sets, based on task events and resource usage; (3) we study the correlation of Google application properties and running features (e.g., job priority and scheduling class); (4) we finally build a model that can simulate Google jobs/tasks and dynamic events, in accordance with Google trace. Experiments show that the tasks simulated based on our model exhibit fairly analogous features with those in Google trace. 95+ % of tasks’ simulation errors are \(<\)20 %, confirming a high accuracy of our simulation model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Scheduling class (0–3), according to [3], roughly represents how latency sensitive a job/task is, with 3 representing a more latency-sensitive task and 0 representing a non-production task.

  2. Google trace does not expose the exact memory size used by jobs but their scaled values compared to the maximum memory capacity of each node. For example, suppose the maximum memory capacity on a host is 64 GB, 0.05 memory size means \(0.05 \times 64=3.2\) GB.

  3. According to Google trace [4], there are different factors for task interruptions: (1) failure event: a task or job was descheduled (or, in rare cases, ceased to be eligible for scheduling while it was pending) due to a task failure; (2) evict event: a task or job was descheduled because of a higher priority task or job, because the scheduler overcommitted and the actual demand exceeded the machine capacity, because the machine on which it was running became unusable, or because a disk holding the task’s data was lost; (3) kill event: a task or job was canceled or another job or task on which this job was dependent died; (4) lost event: a task or job was presumably terminated with a missing record.

References

  1. Armbrust M, Fox A, Griffith R, Joseph A et al (2009), Above the clouds: a Berkeley view of cloud computing. EECS, University of California, Berkeley, Technical Report. UCB/EECS-2009-28

  2. Vaquero L, Rodero-Merino L, Caceres J, Lindner M (2009) A break in the clouds: towards a cloud definition. SIGCOMM Comput Commun Rev 39(1):50–55

    Article  Google Scholar 

  3. Wilkes J (2011) More Google cluster data. Google research blog. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html

  4. Reiss C, Wilkes J, Hellerstein J (2012) Google cluster-usage traces: format + schema. Google Inc., Mountain View, USA, Technical Report

  5. Di S, Kondo D, Cirne W (2012) Characterization and comparison of cloud versus grid workloads. IEEE international conference on cluster computing (cluster’12), pp 230–238

  6. Meng X, Isci C, Kephart J, Zhang L, Bouillet E, Pendarakis D (2010) Efficient resource provisioning in compute clouds via vm multiplexing. In: Proceedings of the 7th international conference on autonomic computing (ICAC’10), New York, ACM, pp 11–20

  7. Buyya R, Ranjan R, Calheiros R (2010) Intercloud: utility-oriented federation of cloud computing environments for scaling of application services. In: 10th international conference on algorithms and architectures for parallel processing (ICA3PP’10), pp 13–31

  8. Stillwell M, Vivien F, Casanova H (2012) Virtual machine resource allocation for service hosting on heterogeneous distributed platforms. In: Proceedings of IEEE 26th international conference on parallel distributed processing symposium (IPDPS’12), pp 786–797

  9. Calheiros R, Ranjan R, Beloglazov A, De-Rose C, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50

    Article  Google Scholar 

  10. Di S, Wang C-L (2013) Dynamic optimization of multi-attribute resource allocation in self-organizing clouds. IEEE Trans Parallel Distrib Syst (TPDS) 24(3):464–478

    Article  Google Scholar 

  11. Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: 5th USENIX symposium on operating systems design and implementation (OSDI’04), pp 137–150

  12. Reiss C, Tumanov A, Ganger G, Katz R, Kozuch M (2012) Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel science and technology center for cloud computing. Carnegie Mellon University, Pittsburgh, Technical Report ISTC-CC-TR-12-101

  13. Feitelson D (2011) Workload modeling for computer systems performance evaluation. http://www.cs.huji.ac.il/~feit/wlmod/

  14. Koch R (1997) The 80/20 principle: the secret of achieving more with less. Nicholas Brealey

  15. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297

  16. Okabe A, Boots B, Sugihara K, Chiu S (2000) Spatial tessellations: concepts and applications of voronoi diagrams, 2nd edn. Series in probability and statistics. Wiley, England

  17. Ross S (2010) Introduction to probability models, 10th edn. Academic Press, Burlington

    MATH  Google Scholar 

  18. Sharma B, Chudnovsky V, Hellerstein J, Rifaat R, Das C (2011) Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of the 2nd ACM symposium on cloud computing (SOCC’11), New York, ACM, pp 3:1–3:14

  19. Mishra A, Hellerstein J, Cirne W, Das C-R (2010) Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS Perform Eval Rev 37(4):34–41

    Article  Google Scholar 

  20. Zhang Q, Hellerstein J.L., Boutaba R (2011) Characterizing task usage shapes in google compute clusters. Large scale distributed systems and middleware, workshop (LADIS’11)

  21. Liu Z, Cho S (2012) Characterizing machines and workloads on a Google cluster. In: 8th international workshop on scheduling and resource management for parallel and distributed systems (SRMPDS’12), pp 397–403

  22. Ganapathi A, Chen Y, Fox A, Katz RH, Patterson DA (2010) Statistics-driven workload modeling for the cloud. ICDE workshops’10, pp 87–92

  23. Shvachko K, Kuang H, Radia S, and Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST’10), pp 1–10

  24. Li A, Zong X, Kandula S, Yang X, Zhang M (2011) Cloudprophet: Towards application performance prediction in cloud. ACM SIGCOMM student poster, pp 426–427

  25. Jackson K.R., Ramakrishnan L, Muriki K at al (2010) Performance analysis of high performance computing applications on the amazon web services cloud. In: Proceedings of the IEEE 2nd international conference on cloud computing technology and science (CloudCom’10). Washington, DC, IEEE Computer Society, pp 159–168

  26. Hamerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the 17th international conference on Information and knowledge management (CIKM’02), New York, ACM, pp 600–607

Download references

Acknowledgments

We thank Google Inc, in particular Charles Reiss and John Wilkes, for making their invaluable trace data available. This work is supported by ANR project Clouds@home (ANR-09-JCJC-0056-01), also in part by the Advanced Scientific Computing Research Program, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and by the INRIA-Illinois Joint Laboratory for Petascale Computing. This paper has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sheng Di.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Di, S., Kondo, D. & Cappello, F. Characterizing and modeling cloud applications/jobs on a Google data center. J Supercomput 69, 139–160 (2014). https://doi.org/10.1007/s11227-014-1131-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1131-z

Keywords

Navigation