Skip to main content
Log in

Consolidated cluster systems for data centers in the cloud age: a survey and analysis

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In the cloud age, heterogeneous application modes on large-scale infrastructures bring about the challenges on resource utilization and manageability to data centers. Many resource and runtime management systems are developed or evolved to address these challenges and relevant problems from different perspectives. This paper tries to identify the main motivations, key concerns, common features, and representative solutions of such systems through a survey and analysis. A typical kind of these systems is generalized as the consolidated cluster system, whose design goal is identified as reducing the overall costs under the quality of service premise. A survey on this kind of systems is given, and the critical issues concerned by such systems are summarized as resource consolidation and runtime coordination. These two issues are analyzed and classified according to the design styles and external characteristics abstracted from the surveyed work. Five representative consolidated cluster systems from both academia and industry are illustrated and compared in detail based on the analysis and classifications. We hope this survey and analysis to be conducive to both design implementation and technology selection of this kind of systems, in response to the constantly emerging challenges on infrastructure and application management in data centers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph A, Katz R, Shenker S, Stoica I. Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11. 2011

    Google Scholar 

  2. Murthy A C, Douglas C, Konar M, O’Malley O, Radia S, Agarwal S, V V K. Architecture of next generation apache hadoop MapReduce framework. Technical report, Apache Hadoop community, 2011

    Google Scholar 

  3. Lu X, Lin J, Zha L, Xu Z. Vega LingCloud: a resource single leasing point system to support heterogeneous application modes on shared infrastructure. In: Proceedings of the 9th International Symposium on Parallel and Distributed Processing with Applications, ISPA’11. 2011, 99–106

    Google Scholar 

  4. Chase J S, Irwin D E, Grit L E, Moore J D, Sprenkle S E. Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, HPDC’03. 2003, 90–100

    Chapter  Google Scholar 

  5. Ramakrishnan L, Koelbel C, Kee Y, Wolski R, Nurmi D, Gannon D, Obertelli G, YarKhan A, Mandal A, Huang T M, Thyagaraja K, Zagorodnov D. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’09. 2009

    Google Scholar 

  6. Kim H, el-Khamra Y, Jha S, Parashar M. An autonomic approach to integrated HPC grid and cloud usage. In: Proceedings of the 5th IEEE International Conference on e-Science, e-Science’09. 2009, 366–373

    Google Scholar 

  7. Lu X, Lin J, Zha L. Architecture and key technologies of LingCloud. Journal of Computer Research and Development, 2011, 48(7): 1111–1122

    Google Scholar 

  8. Baker M, Buyya R. Cluster computing at a glance. In: Buyya R, ed. High Performance Cluster Computing: Architectures and Systems, volume 2. Prentice Hall PTR, 1999, 3–47

    Google Scholar 

  9. Beloglazov A, Buyya R, Lee Y C, Zomaya A. A taxonomy and survey of energy-efficient data centers and cloud computing systems. In: Zelkowitz M V ed. Advances in Computers, Volume 82. Elsevier B.V., 2011, 47–111

    Chapter  Google Scholar 

  10. Wang L, Zhan J, Shi W, Liang Y. In cloud, can scientific communities benefit from the economies of scale? IEEE Transactions on Parallel and Distributed Systems, 2012, 23(2): 296–303

    Article  Google Scholar 

  11. Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing. Software: Practice and Experience, 2002, 32(2): 135–164

    Article  MATH  Google Scholar 

  12. Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03. 2003, 164–177

    Google Scholar 

  13. VMware virtualization software. http://www.vmware.com/

  14. Kivity A, Kamay Y, Laor D, Lublin U, Liguori A. KVM: the Linux virtual machine monitor. In: Proceedings of the 9th Annual Ottawa Linux Symposium, OLS’07. 2007, 225–230

    Google Scholar 

  15. Mell P, Grance T. The NIST definition of cloud computing. Technical Report SP 800-145, Information Technology Laboratory, National Institute of Standards and Technology, 2011

    Google Scholar 

  16. Silberstein M, Geiger D, Schuster A, Livny M. Scheduling mixed workloads in multi-grids: the grid execution hierarchy. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing, HPDC’06. 2006, 291–302

    Google Scholar 

  17. Manyika J, Chui M, Brown B, Bugin J, Dobbs R, Roxburgh C, Byers A H. Big data: the next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute, 2011

    Google Scholar 

  18. Litzkow M, Livny M, Mutka M. Condor-a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, ICDCS’88. 1988, 104–111

    Google Scholar 

  19. Oracle Corporation. Oracle grid engine: an overview. Technical report, 2010

    Google Scholar 

  20. Foster I, Zhao Y, Raicu I, Lu S. Cloud computing and grid computing 360-degree compared. In: Proceedings of Grid Computing Environments Workshop, GCE’08. 2008

    Google Scholar 

  21. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating Systems Design & Implementation, OSDI’04. 2004

    Google Scholar 

  22. Apache Hadoop. http://hadoop.apache.org/

  23. Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design & Implementation, OSDI’10. 2010

    Google Scholar 

  24. Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of 2010 IEEE International Conference on Data Mining Workshops, ICDMW’10. 2010, 170–177

    Chapter  Google Scholar 

  25. Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994

    Google Scholar 

  26. MPICH2: High-performance and widely portable MPI. http://www.mcs.anl.gov/research/projects/mpich2/

  27. Graham R L, Shipman G M, Barrett B, Castain R H, Bosilca G, Lumsdaine A. Open MPI: a high-performance, heterogeneous MPI. In: Proceedings of 2006 IEEE International Conference on Cluster Computing, Cluster’06. 2006

    Google Scholar 

  28. Armbrust M, Fox A, Griffith R, Joseph A, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. Above the clouds: a berkeley view of cloud computing. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley, 2009

    Google Scholar 

  29. Wentzlaff D, Gruenwald III C, Beckmann N, Modzelewski K, Belay A, Youseff L, Miller J, Agarwal A. An operating system for multicore and clouds: mechanisms and implementation. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC’10. 2010, 3–14

    Chapter  Google Scholar 

  30. Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys’10. 2010, 265–278

    Chapter  Google Scholar 

  31. Benson T, Akella A, Maltz D A. Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th Annual Conference on Internet Measurement, IMC’10. 2010, 267–280

    Chapter  Google Scholar 

  32. Boutaba R, Cheng L, Zhang Q. On cloud computational models and the heterogeneity challenge. Journal of Internet Services and Applications, 2012, 3(1): 77–86

    Article  Google Scholar 

  33. Zaharia M, Konwinski A, Joseph A D, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on Operating Systems Design & Implementation, OSDI’08. 2008

    Google Scholar 

  34. Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC’04. 2004

    Google Scholar 

  35. Liu J, Chandrasekaran B, Wu J, Jiang W, Kini S, Yu W, Buntinas D, Wyckoff P, Panda D K. Performance comparison of MPI implementations over InfiniBand, myrinet and quadrics. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC’03. 2003

    Google Scholar 

  36. Greenberg A, Hamilton J, Maltz D A, Patel P. The cost of a cloud: research problems in data center networks. ACM SIGCOMM Computer Communication Review, 2008, 39(1): 68–73

    Article  Google Scholar 

  37. Abadi D J. Data management in the cloud: limitations and opportunities. IEEE Data Engineering Bulletin, 2009, 32(1): 3–12

    Google Scholar 

  38. Buyya R, Beloglazov A, Abawajy J H. Energy-efficient management of data center resources for cloud computing: a vision, architectural elements, and open challenges. In: Proceedings of the 2010 International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA’10. 2010, 6–20

    Google Scholar 

  39. Ramgovind S, Eloff M M, Smith E. The management of security in cloud computing. In: Proceedings of the 9th Annual Information Security for South Africa Conference, ISSA’10. 2010

    Google Scholar 

  40. Mehta S, Neogi A. ReCon: a tool to recommend dynamic server consolidation in multi-cluster data centers. In: Proceedings of the 11th IEEE/IFIP Network Operations and Management Symposium, NOMS’08. 2008, 363–370

    Google Scholar 

  41. Zhan J, Wang L, Tu B, Li Y, Wang P, Zhou W, Meng D. Phoenix cloud: consolidating different computing loads on shared cluster system for large organization. In: Proceedings of the 1st Workshop on Cloud Computing and Its Applications, CCA’08. 2008

    Google Scholar 

  42. Calheiros R N, Ranjan R, Beloglazov A, De Rose C A F, Buyya R. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience, 2011, 41(1): 23–50

    Article  Google Scholar 

  43. Livny M. Condor and the cloud-the challenges and the roadmap of condor. http://www.grid.org.il/_Uploads/dbsAttachedFiles/Condor-Cloud-IGT.pdf, 2009

    Google Scholar 

  44. Linux containers. http://lxc.sourceforge.net/

  45. Koziolek H. Performance evaluation of component-based software systems: a survey. Performance Evaluation, 2010, 67(8): 634–658

    Article  Google Scholar 

  46. Huai Y, Lee R, Zhang S, Xia C H, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, SoCC’11. 2011, 1–14

    Chapter  Google Scholar 

  47. Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurrency and Computation: Practice and Experience, 2005, 17(2–4): 323–356

    Article  Google Scholar 

  48. Youseff L, Butrico M, Da Silva D. Toward a unified ontology of cloud computing. In: Proceedings of Grid Computing Environments Workshop, GCE’08. 2008

    Google Scholar 

  49. Apache Mesos: dynamic resource sharing for clusters. http://incubator.apache.org/mesos/

  50. Lee G, Chun B, Katz R H. Heterogeneity-aware resource allocation and scheduling in the cloud. In: Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’11. 2011

    Google Scholar 

  51. Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10. 2010

    Google Scholar 

  52. Apache ZooKeeper. http://zookeeper.apache.org/

  53. Murthy A C. The next generation of apache hadoop MapReduce. http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreducenextgen/, 2011

  54. Apache HBase. http://hbase.apache.org/

  55. Seo S, Yoon E J, Kim J, Jin S, Kim J, Maeng S. HAMA: an efficient matrix computation with the MapReduce framework. In: Proceedings of the 2nd International Conference on Cloud Computing Technology and Science, CloudCom’10. 2010, 721–726

    Chapter  Google Scholar 

  56. Apache giraph. http://incubator.apache.org/giraph/

  57. Pandey J. RPC improvements and wire compatibility in apache hadoop. http://hortonworks.com/blog/rpc-improvements-and-wire-compatibility-in-apache-hadoop/, 2012

    Google Scholar 

  58. Wright D. Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Proceedings of the LCI International Conference on Linux Clusters: The HPC Revolution. 2001

    Google Scholar 

  59. Thain G. Condor integrated with hadoop’s map reduce. http://research.cs.wisc.edu/condor/CondorWeek2010/condor-presentations/thain-condor-hadoop.pdf, 2010

    Google Scholar 

  60. Foster I, and Kesselman C. Globus: a metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 1997, 11(2): 115–128

    Article  Google Scholar 

  61. Henderson R. Job scheduling under the portable batch system. In: Feitelson D, Rudolph L, eds. Job Scheduling Strategies for Parallel Processing. LNCS. Springer Berlin / Heidelberg, 1995, 949: 279–294

    Article  Google Scholar 

  62. Coleman N, Raman R, Livny M, Solomon M. Distributed policy management and comprehension with classified advertisements. Technical Report UW-CS-TR-1481, Computer Sciences Department, University of Wisconsin-Madison, 2003

    Google Scholar 

  63. Couvares P, Kosar T, Roy A, Weber J, Wenger K. Workflow management in condor. In: Taylor I J, Deelman E, Gannon D B, Shields M, eds. Workflows for e-Science. Springer London, 2007, 357–375

    Chapter  Google Scholar 

  64. Basney J, Livny M. Deploying a high throughput computing cluster. In: Buyya R, ed. High Performance Cluster Computing: Architectures and Systems, Volume 1. Prentice Hall PTR, 1999, 116–134

    Google Scholar 

  65. Farrellee M. Condor: cloud scheduler. http://spinningmatt.files.wordpress.com/2010/04/matthewfarrelleeopensourcecloudcomputingforum-10feb2010.pdf, 2010

    Google Scholar 

  66. Open grid scheduler: the official open source grid engine. http://gridscheduler.sourceforge.net/

  67. Son of grid engine. https://arc.liv.ac.uk/trac/SGE

  68. Sun microsystems. Sun ONE grid engine, enterprise edition administration and user’s guide. Technical Report 816-4739-11, 2002

    Google Scholar 

  69. Troger P, Rajic H, Haas A, Domagalski P. Standardization of an API for distributed resource management systems. In: Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID’07. 2007, 619–626

    Chapter  Google Scholar 

  70. Gentzsch W. Sun grid engine: towards creating a compute power grid. In: Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGIRD’01 2001, 35–36

  71. Oracle Corporation. Extreme scalability using oracle grid engine software: managing extreme workloads. Technical report, 2010

    Google Scholar 

  72. Templeton D. Intro to service domain manager. http://blogs.oracle.com/templedf/entry/service_domain_manager, 2010

    Google Scholar 

  73. Sotomayor B, Montero R S, Llorente I M, Foster I. Virtual infrastructure management in private and hybrid clouds. IEEE Internet Computing, 2009, 13(5): 14–22

    Article  Google Scholar 

  74. Mugler J, Naughton T, Scott S L. OSCAR meta-package system. In: Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications, HPCS’05. 2005, 353–360

    Chapter  Google Scholar 

  75. Massie ML, Chun B N, Culler D E. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840.

    Article  Google Scholar 

  76. Zha L, Li W, Yu H, Xie X, Xiao N, Xu Z. System software for China national grid. In: Proceedings of IFIP International Conference on Network and Parallel Computing, NPC’05. 2005, 14–21

    Google Scholar 

  77. Lin J, Lu X, Yu L, Zou Y, Zha L. VegaWarden: a uniform user management system for cloud applications. In: Proceedings of the 5th IEEE International Conference on Networking, Architecture and Storage, NAS’10. 2010, 457–464

    Chapter  Google Scholar 

  78. Yu L, Zha L, Wang X, Zhou H, Zou Y. GOS security: design and implementation. In: Proceedings of the 15th International Conference on Parallel and Distributed Systems, ICPADS’09. 2009, 955–960

    Chapter  Google Scholar 

  79. Steinder M, Whalley I, Carrera D, Gaweda I, Chess D. Server virtualization in autonomic management of heterogeneous workloads. In: Proceedings of the 10th IFIP/IEEE International Symposium on Integrated Network Management, IM’07. 2007, 139–148

    Chapter  Google Scholar 

  80. Mateescu G, Gentzsch W, Ribbens C J. Hybrid computing-where HPC meets grid and cloud computing. Future Generation Computer Systems, 2011, 27(5): 440–453

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Lin.

Additional information

Jian Lin is a PhD candidate in computer architecture at Institute of Computing Technology, Chinese Academy of Sciences. His current research interests include distributed software architecture, large-scale resource management, and security technologies in grid and cloud computing systems.

Li Zha obtained his PhD in 2003, and is an associate professor of Institute of Computing Technology, Chinese Academy of Sciences. He has been the project leader of several national level research programs. His research is focused on large-scale distributed resource management, data storage/processing/retrieval and system level optimization. His interests also include other classic issues in distributed computing and grid computing field.

Zhiwei Xu received the PhD from University of Southern California in 1987. He is currently a professor of Institute of Computing Technology, Chinese Academy of Sciences. His research interests include network computing, distributed operating systems, and high-performance computer architecture. His editorial board services include the IEEE Transactions on Services Computing, Journal of Grid Computing, Journal of Computer Science and Technology, and Journal of Computer Research and Development. He is a senior member of the IEEE.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, J., Zha, L. & Xu, Z. Consolidated cluster systems for data centers in the cloud age: a survey and analysis. Front. Comput. Sci. 7, 1–19 (2013). https://doi.org/10.1007/s11704-012-2086-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-012-2086-y

Keywords

Navigation