Skip to main content
Log in

Cluster resource scheduling in cloud computing: literature review and research challenges

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Scheduling plays a pivotal role in cloud computing systems. Designing an efficient scheduler is a challenging task. The challenge comes from several aspects, including the multi-dimensionality of resource demands, heterogeneity of jobs, diversity of computing resources, and fairness between multiple tenants sharing the cluster. This survey provides a multi-perspective overview of the cluster scheduling problem. We present a multi-dimensional classification of existing cluster management solutions based on their scheduling architectures, objectives, and methods. We also survey the recent research works which have employed machine learning solutions in cloud computing resource management. Existing cluster scheduling systems face many challenges, such as achieving a tradeoff between multiple conflicting objectives, finding the balance between jobs’ requirements, scaling to the new operational demands, and choosing the appropriate scheduling architecture. Using machine learning in cluster scheduling is a promising direction to go to develop the future generation of intelligent resource schedulers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Alipourfard O, Liu HH, Chen J, Venkataraman S, Yu M, Zhang M (2017) Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: 14th \(\{\)USENIX\(\}\) symposium on networked systems design and implementation (\(\{\)NSDI\(\}\) 17), pp 469–482

  2. Asch M, Moore T, Badia R, Beck M, Beckman P, Bidot T, Bodin F, Cappello F, Choudhary A, de Supinski B et al (2018) Big data and extreme-scale computing: pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl 32(4):435–479

    Article  Google Scholar 

  3. Bao Y, Peng Y, Wu C (2019)Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, pp 505–513

  4. Boutin E, Ekanayake J, Lin W, Shi B, Zhou J, Qian Z, Wu M, Zhou L (2014) Apollo: scalable and coordinated scheduling for cloud-scale computing. In: 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 285–300

  5. Cambridge U (2016) The evolution of cluster scheduler architectures. http://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html

  6. Chen G, He W, Liu J, Nath S, Rigas L, Xiao L, Zhao F (2008) Energy-aware server provisioning and load dispatching for connection-intensive internet services. In: NSDI, vol 8, pp 337–350

  7. Cheong M, Lee H, Yeom I, Woo H (2019) Scarl: attentive reinforcement learning-based scheduling in a multi-resource heterogeneous cluster. IEEE Access 7:153432–153444

    Article  Google Scholar 

  8. Chronos: Chronos: a fault tolerant job scheduler for mesos which handles dependencies and iso8601 based schedules. https://mesos.github.io/chronos/docs/

  9. Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, pp 153–167

  10. Delgado P, Didona D, Dinu F, Zwaenepoel, W.:ACM, (2016) Job-aware scheduling in eagle: divide and stick to your probes. In: Proceedings of the seventh ACM symposium on cloud computing. ACM, pp 497–509

  11. Delgado P, Dinu F, Didona D, Zwaenepoel W (2016) Eagle: a better hybrid data center scheduler. Tech, Rep

  12. Delgado P, Dinu F, Kermarrec AM, Zwaenepoel W (2015) Hawk: hybrid datacenter scheduling. In: 2015 USENIX Annual Technical Conference (USENIX ATC 15), pp 499–510

  13. Delimitrou C, Kozyrakis C (2013) Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48(4):77–88

    Article  Google Scholar 

  14. Delimitrou C, Kozyrakis C (2014) Quasar: resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49(4):127–144

    Article  Google Scholar 

  15. Delimitrou C, Sanchez D, Kozyrakis C (2015) Tarcil: reconciling scheduling speed and quality in large shared clusters. In: Proceedings of the sixth ACM symposium on cloud computing. ACM, pp 97–110

  16. Di S, Kondo D, Cappello F (2014) Characterizing and modeling cloud applications/jobs on a google data center. J Supercomput 69(1):139–160

    Article  Google Scholar 

  17. Di S, Kondo D, Cirne W (2012) Characterization and comparison of cloud versus grid workloads. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 230–238

  18. Di S, Kondo D, Cirne W (2014) Google hostload prediction based on bayesian model with optimized feature combination. J Parallel Distrib Comput 74(1):1820–1832

    Article  Google Scholar 

  19. Dimopoulos S, Krintz C, Wolski R (2017) Justice: a deadline-aware, fair-share resource allocator for implementing multi-analytics. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 233–244

  20. Dong Z, Zhuang W, Rojas-Cessa R (2014) Energy-aware scheduling schemes for cloud data centers on google trace data. In: 2014 IEEE Online Conference on Green Communications (OnlineGreenComm). IEEE, pp 1–6

  21. flink: Apache flink. https://flink.apache.org/

  22. Foundation AS (2012) Hadoop: fair scheduler. https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

  23. Garefalakis P, Karanasos K, Pietzuch PR, Suresh A, Rao S (2018) Medea: scheduling of long running applications in shared production clusters. In: EuroSys, pp 1–13

  24. Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker S, Stoica I (2011) Dominant resource fairness: fair allocation of multiple resource types. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp 323–336

  25. Ghodsi A, Zaharia M, Shenker S, Stoica I (2013) Choosy: max-min fair sharing for choosy: max-min fair sharing for data-center jobs with constraints. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, pp 365–378

  26. Ghomi EJ, Rahmani AM, Qader NN (2017) Load-balancing algorithms in cloud computing: a survey. J Netw Comput Appl 88:50–71

    Article  Google Scholar 

  27. github: google/cluster-data. https://github.com/google/cluster-data

  28. Gog I, Schwarzkopf M, Gleave A, Watson RN, Hand S (2016) Firmament: fast, centralized cluster scheduling at scale. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp 99–115

  29. Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A (2014) Multi-resource packing for cluster schedulers. ACM SIGCOMM Comput Commun Rev 44(4):455–466

    Article  Google Scholar 

  30. Guo J, Chang Z, Wang S, Ding H, Feng Y, Mao L, Bao Y (2019) Who limits the resource efficiency of my datacenter: an analysis of alibaba datacenter traces. In: 2019 IEEE/ACM 27th international symposium on quality of service (IWQoS). IEEE, pp 1–10

  31. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp 295–308

  32. Huang J, Nicol DM, Campbell RH (2014)Denial-of-service threat to hadoop/yarn clusters with multitenancy. In: 2014 IEEE international congress on big data (BigData Congress). IEEE, pp 48–55

  33. Inc D (2019) Docker documentation. https://docs.docker.com/

  34. Iqbal W, Berral JL, Erradi A, Carrera D et al (2019) Adaptive prediction models for data center resources utilization estimation. IEEE Trans Netw Serv Manage 16(4):1681–1693

    Article  Google Scholar 

  35. Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy:fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. ACM, pp 261–276

  36. Jennings B, Stadler R (2015) Resource management in clouds: survey and research challenges. J Netw Syst Manage 23(3):567–619

    Article  Google Scholar 

  37. Jiang C, Han G, Lin J, Jia G, Shi W, Wan J (2019) Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from alibaba cloud. IEEE Access 7:22495–22508

    Article  Google Scholar 

  38. Karanasos K, Rao S, Curino C, Douglas C, Chaliparambil K, Fumarola GM, Heddaya S, Ramakrishnan R, Sakalanaga S (2015) Mercury: hybrid centralized and distributed scheduling in large shared clusters. In: USENIX Annual Technical Conference, pp 485–497

  39. Kaufmann M, Kourtis K, Schuepbach A, Zitterbart, M (2018) Mira: sharing resources for distributed analytics at small timescales. In: IEEE International Conference on Big Data. IEEE

  40. Kaur K, Kumar N, Garg S, Rodrigues JJ (2018) Enloc: data locality-aware energy efficient scheduling scheme for cloud data centers. In: 2018 IEEE International Conference on Communications (ICC). IEEE, pp 1–6

  41. Keahey K, Parashar M (2014) Enabling on-demand science via cloud computing. IEEE Cloud Comput 1(1):21–27

    Article  Google Scholar 

  42. Khamse-Ashari J, Lambadaris I, Kesidis G, Urgaonkar B, Zhao Y (2017) Per-server dominant-share fairness (ps-dsf): a multi-resource fair allocation mechanism for heterogeneous servers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–7

  43. kubernetes: kube-scheduler. https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/

  44. kubernetes: Production-grade container orchestration. https://kubernetes.io/

  45. Lee G, Katz RH (2011) Heterogeneity-aware resource allocation and scheduling in the cloud. In: HotCloud

  46. Li Q, Xu J, Cao C (2020) Scheduling distributed deep learning jobs in heterogeneous cluster with placement awareness. In: 12th Asia-Pacific symposium on internetware, pp 217–228

  47. Liu J, Shen H, Chen L (2016) Corp: cooperative opportunistic resource provisioning for short-lived jobs in cloud systems. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 90–99

  48. Liu Z, Cho S (2012) Characterizing machines and workloads on a google cluster. In: 2012 41st International Conference on Parallel Processing Workshops (ICPPW). IEEE, pp 397–403

  49. Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM workshop on hot topics in networks. ACM, pp 50–56

  50. Mao H, Schwarzkopf M, Venkatakrishnan SB, Meng Z, Alizadeh M (2019) Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM special interest group on data communication, pp 270–288

  51. Mason K, Duggan M, Barrett E, Duggan J, Howley E (2018) Predicting host cpu utilization in the cloud using evolutionary neural networks. Future Generation Comput Syst 86:162–173

    Article  Google Scholar 

  52. Mell P, Grance T et al (2011) The nist definition of cloud computing. Computer security division. Information Technology Laboratory, National Institute of Standards and Technology Gaithersburg

  53. Mesosphere I (2018) Marathon: a container orchestration platform for mesos and dc/os. https://mesosphere.github.io/marathon/

  54. Meyer V, Kirchoff DF, da Silva ML, De Rose CA (2020) An interference-aware application classifier based on machine learning to improve scheduling in clouds. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, pp 80–87

  55. Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Elibol M, Yang Z, Pau, W, Jordan MI et al (2018) Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 18), pp 561–577

  56. Nair V (2016) Quality of service for hadoop: it’s about time. https://www.oreilly.com/ideas/quality-of-service-for-hadoop-its-about-time

  57. Nguyen HM, Kalra G, Kim D (2019) Host load prediction in cloud computing using long short-term memory encoder-decoder. J Supercomput 75(11):7592–7605

    Article  Google Scholar 

  58. Nishtala R, Carpenter P, Petrucci V, Martorell X (2017) The hipster approach for improving cloud system efficiency. ACM Trans Comput Syst (TOCS) 35(3):8

    Article  Google Scholar 

  59. Niu Z, Tang S, He B (2015) Gemini: An adaptive performance-fairness scheduler for data-intensive cluster computing. In: 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 66–73

  60. Niu Z, Tang S, He B (2016) An adaptive efficiency-fairness meta-scheduler for data-intensive computing. IEEE Trans Serv Comput 12(6):865–879

    Article  Google Scholar 

  61. openstack: openstack. https://www.openstack.org/

  62. Orhean AI, Pop F, Raicu I (2018) New scheduling approach using reinforcement learning for heterogeneous distributed systems. J Parallel Distrib Comput 117:292–302

    Article  Google Scholar 

  63. Ousterhout K, Wendell P, Zaharia M, Stoica, I (2013) Sparrow: distributed, low latency scheduling. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 69–84

  64. Park G (2011) A generalization of multiple choice balls-into-bins. In: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing, pp. 297–298. ACM (2011)

  65. Parkes DC, Procaccia AD, Shah N (2015) Beyond dominant resource fairness: extensions, limitations, and indivisibilities. ACM Trans Econ Comput 3(1):3

    Article  MathSciNet  Google Scholar 

  66. Peng Y, Bao Y, Chen Y, Wu C, Meng C, Lin W (2021) Dl2: a deep learning-driven scheduler for deep learning clusters. IEEE Trans Parallel Distrib Syst 32(8):1947–1960

    Article  Google Scholar 

  67. Piraghaj SF, Dastjerdi AV, Calheiros RN, Buyya R (2015) A framework and algorithm for energy efficient container consolidation in cloud data centers. In: 2015 IEEE International Conference on Data Science and Data Intensive Systems. IEEE, pp 368–375

  68. Qu H, Mashayekhi O, Terei D, Levis P (2016) Canary: a scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412

  69. Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch, MA (2012) Heterogeneity and dynamicity of clouds at scale: google trace analysis. In: Proceedings of the third ACM symposium on cloud computing. ACM, p. 7

  70. Rjoub G, Bentahar J (2017) Cloud task scheduling based on swarm intelligence and machine learning. In: 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE, pp 272–279

  71. Rjoub G, Bentahar J, Wahab OA (2020) Bigtrustscheduling: trust-aware big data task scheduling approach in cloud computing environments. Future Generation Comput Syst 110:1079–1097

    Article  Google Scholar 

  72. Rodriguez MA, Buyya R (2019) Container-based cluster orchestration systems: a taxonomy and future directions. Softw Pract Exp 49(5):698–719

    Article  Google Scholar 

  73. Sant’Ana L, Carastan-Santos D, Cordeiro D, De Camargo R (2019) Real-time scheduling policy selection from queue and machine states. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID). IEEE, pp 381–390

  74. Scharf M, Stein M, Voith T, Hilt V (2015) Network-aware instance scheduling in open-stack. In: 2015 24th International Conference on Computer Communication and Networks (ICCCN). IEEE, pp 1–6

  75. Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: exible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, pp 351–364

  76. Shao Y, Li C, Gu J, Zhang J, Luo Y (2018) Efficient jobs scheduling approach for big data applications. Comput Indus Eng 117:249–261

    Article  Google Scholar 

  77. Singh S, Chana I (2016) Cloud resource provisioning: survey, status and future research directions. Knowl Inform Syst 49(3):1005–1069

    Article  Google Scholar 

  78. Singh S, Chana I (2016) A survey on resource scheduling in cloud computing: issues and challenges. J Grid Comput 14(2):217–264

    Article  Google Scholar 

  79. slurm: slurm workload manager. https://slurm.schedmd.com/documentation.html

  80. Software OC Scheduling. https://docs.openstack.org/kilo/config-reference/content/section_compute-scheduler.html#filter-scheduler

  81. Song B, Yu Y, Zhou Y, Wang Z, Du S (2018) Host load prediction with long short-term memory in cloud computing. J Supercomput 74(12):6554–6568

    Article  Google Scholar 

  82. Spark A Apache spark. https://spark.apache.org/

  83. Talluri S, Łuszczak A, Abad CL, Iosup A (2019) Characterization of a big data storage workload in the cloud. In: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, pp 33–44

  84. Thinakaran P, Gunasekaran JR, Sharma B, Kandemir MT, Das CR (2017) Phoenix: a constraint-aware scheduler for heterogeneous datacenters. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 977–987

  85. Tumanov A, Cipar J, Ganger GR, Kozuch MA (2012) alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In: Proceedings of the third ACM symposium on cloud computing. ACM, p 25

  86. Tumanov A, Zhu T, Park JW, Kozuch MA, Harchol-Balter M, Ganger GR (2016) Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In: Proceedings of the Eleventh European Conference on Computer Systems. ACM, p 35

  87. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth, S et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. ACM, p 5

  88. Venkataraman S, Yang Z, Franklin MJ, Recht B, Stoica I (2016) Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp 363–378

  89. Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes, J (2015) Large-scale cluster management at google with borg. In: Proceedings of the tenth European Conference on Computer Systems. ACM, p 18

  90. Wang W, Li B, Liang B (2014)Dominant resource fairness in cloud computing systems with heterogeneous servers. In: INFOCOM, 2014 Proceedings IEEE. IEEE, pp 583–591

  91. Wang Y, Liu H, Zheng W, Xia Y, Li Y, Chen P, Guo K, Xie H (2019) Multi-objective workflow scheduling with deep-q-network-based multi-agent reinforcement learning. IEEE Access 7:39974–39982

    Article  Google Scholar 

  92. Weerasiri D, Barukh MC, Benatallah B, Sheng QZ, Ranjan R (2017) A taxonomy and survey of cloud resource orchestration techniques. ACM Comput Surv (CSUR) 50(2):1–41

    Article  Google Scholar 

  93. White T (2012) Hadoop: The definitive guide. ”O’Reilly Media, Inc.”,

  94. Wu F, Wu Q, Tan Y (2015) Workflow scheduling in cloud: a survey. J Supercomput 71(9):3373–3418

    Article  Google Scholar 

  95. Yang Q, Zhou Y, Yu Y, Yuan J, Xing X, Du S (2015) Multi-step-ahead host load prediction using autoencoder and echo state networks in cloud computing. J Supercomput 71(8):3037–3053

    Article  Google Scholar 

  96. Yu Y, Jindal V, Yen IL, Bastani F (2016) Integrating clustering and learning for improved workload prediction in the cloud. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD). IEEE, pp 876–879

  97. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10:95

    Google Scholar 

  98. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially supported by the National Science Foundation under grant CNS-1828593.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wael Khallouli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khallouli, W., Huang, J. Cluster resource scheduling in cloud computing: literature review and research challenges. J Supercomput 78, 6898–6943 (2022). https://doi.org/10.1007/s11227-021-04138-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04138-z

Keywords

Navigation