Cluster resource scheduling in cloud computing: literature review and research challenges

Khallouli, Wael; Huang, Jingwei

doi:10.1007/s11227-021-04138-z

Cluster resource scheduling in cloud computing: literature review and research challenges

Published: 29 October 2021

Volume 78, pages 6898–6943, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

1931 Accesses
12 Citations
Explore all metrics

Abstract

Scheduling plays a pivotal role in cloud computing systems. Designing an efficient scheduler is a challenging task. The challenge comes from several aspects, including the multi-dimensionality of resource demands, heterogeneity of jobs, diversity of computing resources, and fairness between multiple tenants sharing the cluster. This survey provides a multi-perspective overview of the cluster scheduling problem. We present a multi-dimensional classification of existing cluster management solutions based on their scheduling architectures, objectives, and methods. We also survey the recent research works which have employed machine learning solutions in cloud computing resource management. Existing cluster scheduling systems face many challenges, such as achieving a tradeoff between multiple conflicting objectives, finding the balance between jobs’ requirements, scaling to the new operational demands, and choosing the appropriate scheduling architecture. Using machine learning in cluster scheduling is a promising direction to go to develop the future generation of intelligent resource schedulers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

Article 13 April 2024

References

Alipourfard O, Liu HH, Chen J, Venkataraman S, Yu M, Zhang M (2017) Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: 14th \(\{\)USENIX\(\}\) symposium on networked systems design and implementation (\(\{\)NSDI\(\}\) 17), pp 469–482
Asch M, Moore T, Badia R, Beck M, Beckman P, Bidot T, Bodin F, Cappello F, Choudhary A, de Supinski B et al (2018) Big data and extreme-scale computing: pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl 32(4):435–479
Article Google Scholar
Bao Y, Peng Y, Wu C (2019)Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, pp 505–513
Boutin E, Ekanayake J, Lin W, Shi B, Zhou J, Qian Z, Wu M, Zhou L (2014) Apollo: scalable and coordinated scheduling for cloud-scale computing. In: 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 285–300
Cambridge U (2016) The evolution of cluster scheduler architectures. http://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html
Chen G, He W, Liu J, Nath S, Rigas L, Xiao L, Zhao F (2008) Energy-aware server provisioning and load dispatching for connection-intensive internet services. In: NSDI, vol 8, pp 337–350
Cheong M, Lee H, Yeom I, Woo H (2019) Scarl: attentive reinforcement learning-based scheduling in a multi-resource heterogeneous cluster. IEEE Access 7:153432–153444
Article Google Scholar
Chronos: Chronos: a fault tolerant job scheduler for mesos which handles dependencies and iso8601 based schedules. https://mesos.github.io/chronos/docs/
Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R (2017) Resource central: understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th symposium on operating systems principles, pp 153–167
Delgado P, Didona D, Dinu F, Zwaenepoel, W.:ACM, (2016) Job-aware scheduling in eagle: divide and stick to your probes. In: Proceedings of the seventh ACM symposium on cloud computing. ACM, pp 497–509
Delgado P, Dinu F, Didona D, Zwaenepoel W (2016) Eagle: a better hybrid data center scheduler. Tech, Rep
Delgado P, Dinu F, Kermarrec AM, Zwaenepoel W (2015) Hawk: hybrid datacenter scheduling. In: 2015 USENIX Annual Technical Conference (USENIX ATC 15), pp 499–510
Delimitrou C, Kozyrakis C (2013) Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48(4):77–88
Article Google Scholar
Delimitrou C, Kozyrakis C (2014) Quasar: resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49(4):127–144
Article Google Scholar
Delimitrou C, Sanchez D, Kozyrakis C (2015) Tarcil: reconciling scheduling speed and quality in large shared clusters. In: Proceedings of the sixth ACM symposium on cloud computing. ACM, pp 97–110
Di S, Kondo D, Cappello F (2014) Characterizing and modeling cloud applications/jobs on a google data center. J Supercomput 69(1):139–160
Article Google Scholar
Di S, Kondo D, Cirne W (2012) Characterization and comparison of cloud versus grid workloads. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 230–238
Di S, Kondo D, Cirne W (2014) Google hostload prediction based on bayesian model with optimized feature combination. J Parallel Distrib Comput 74(1):1820–1832
Article Google Scholar
Dimopoulos S, Krintz C, Wolski R (2017) Justice: a deadline-aware, fair-share resource allocator for implementing multi-analytics. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 233–244
Dong Z, Zhuang W, Rojas-Cessa R (2014) Energy-aware scheduling schemes for cloud data centers on google trace data. In: 2014 IEEE Online Conference on Green Communications (OnlineGreenComm). IEEE, pp 1–6
flink: Apache flink. https://flink.apache.org/
Foundation AS (2012) Hadoop: fair scheduler. https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
Garefalakis P, Karanasos K, Pietzuch PR, Suresh A, Rao S (2018) Medea: scheduling of long running applications in shared production clusters. In: EuroSys, pp 1–13
Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker S, Stoica I (2011) Dominant resource fairness: fair allocation of multiple resource types. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp 323–336
Ghodsi A, Zaharia M, Shenker S, Stoica I (2013) Choosy: max-min fair sharing for choosy: max-min fair sharing for data-center jobs with constraints. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, pp 365–378
Ghomi EJ, Rahmani AM, Qader NN (2017) Load-balancing algorithms in cloud computing: a survey. J Netw Comput Appl 88:50–71
Article Google Scholar
github: google/cluster-data. https://github.com/google/cluster-data
Gog I, Schwarzkopf M, Gleave A, Watson RN, Hand S (2016) Firmament: fast, centralized cluster scheduling at scale. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp 99–115
Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A (2014) Multi-resource packing for cluster schedulers. ACM SIGCOMM Comput Commun Rev 44(4):455–466
Article Google Scholar
Guo J, Chang Z, Wang S, Ding H, Feng Y, Mao L, Bao Y (2019) Who limits the resource efficiency of my datacenter: an analysis of alibaba datacenter traces. In: 2019 IEEE/ACM 27th international symposium on quality of service (IWQoS). IEEE, pp 1–10
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp 295–308
Huang J, Nicol DM, Campbell RH (2014)Denial-of-service threat to hadoop/yarn clusters with multitenancy. In: 2014 IEEE international congress on big data (BigData Congress). IEEE, pp 48–55
Inc D (2019) Docker documentation. https://docs.docker.com/
Iqbal W, Berral JL, Erradi A, Carrera D et al (2019) Adaptive prediction models for data center resources utilization estimation. IEEE Trans Netw Serv Manage 16(4):1681–1693
Article Google Scholar
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy:fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. ACM, pp 261–276
Jennings B, Stadler R (2015) Resource management in clouds: survey and research challenges. J Netw Syst Manage 23(3):567–619
Article Google Scholar
Jiang C, Han G, Lin J, Jia G, Shi W, Wan J (2019) Characteristics of co-allocated online services and batch jobs in internet data centers: a case study from alibaba cloud. IEEE Access 7:22495–22508
Article Google Scholar
Karanasos K, Rao S, Curino C, Douglas C, Chaliparambil K, Fumarola GM, Heddaya S, Ramakrishnan R, Sakalanaga S (2015) Mercury: hybrid centralized and distributed scheduling in large shared clusters. In: USENIX Annual Technical Conference, pp 485–497
Kaufmann M, Kourtis K, Schuepbach A, Zitterbart, M (2018) Mira: sharing resources for distributed analytics at small timescales. In: IEEE International Conference on Big Data. IEEE
Kaur K, Kumar N, Garg S, Rodrigues JJ (2018) Enloc: data locality-aware energy efficient scheduling scheme for cloud data centers. In: 2018 IEEE International Conference on Communications (ICC). IEEE, pp 1–6
Keahey K, Parashar M (2014) Enabling on-demand science via cloud computing. IEEE Cloud Comput 1(1):21–27
Article Google Scholar
Khamse-Ashari J, Lambadaris I, Kesidis G, Urgaonkar B, Zhao Y (2017) Per-server dominant-share fairness (ps-dsf): a multi-resource fair allocation mechanism for heterogeneous servers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–7
kubernetes: kube-scheduler. https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
kubernetes: Production-grade container orchestration. https://kubernetes.io/
Lee G, Katz RH (2011) Heterogeneity-aware resource allocation and scheduling in the cloud. In: HotCloud
Li Q, Xu J, Cao C (2020) Scheduling distributed deep learning jobs in heterogeneous cluster with placement awareness. In: 12th Asia-Pacific symposium on internetware, pp 217–228
Liu J, Shen H, Chen L (2016) Corp: cooperative opportunistic resource provisioning for short-lived jobs in cloud systems. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 90–99
Liu Z, Cho S (2012) Characterizing machines and workloads on a google cluster. In: 2012 41st International Conference on Parallel Processing Workshops (ICPPW). IEEE, pp 397–403
Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM workshop on hot topics in networks. ACM, pp 50–56
Mao H, Schwarzkopf M, Venkatakrishnan SB, Meng Z, Alizadeh M (2019) Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM special interest group on data communication, pp 270–288
Mason K, Duggan M, Barrett E, Duggan J, Howley E (2018) Predicting host cpu utilization in the cloud using evolutionary neural networks. Future Generation Comput Syst 86:162–173
Article Google Scholar
Mell P, Grance T et al (2011) The nist definition of cloud computing. Computer security division. Information Technology Laboratory, National Institute of Standards and Technology Gaithersburg
Mesosphere I (2018) Marathon: a container orchestration platform for mesos and dc/os. https://mesosphere.github.io/marathon/
Meyer V, Kirchoff DF, da Silva ML, De Rose CA (2020) An interference-aware application classifier based on machine learning to improve scheduling in clouds. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, pp 80–87
Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Elibol M, Yang Z, Pau, W, Jordan MI et al (2018) Ray: a distributed framework for emerging \(\{\)AI\(\}\) applications. In: 13th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 18), pp 561–577
Nair V (2016) Quality of service for hadoop: it’s about time. https://www.oreilly.com/ideas/quality-of-service-for-hadoop-its-about-time
Nguyen HM, Kalra G, Kim D (2019) Host load prediction in cloud computing using long short-term memory encoder-decoder. J Supercomput 75(11):7592–7605
Article Google Scholar
Nishtala R, Carpenter P, Petrucci V, Martorell X (2017) The hipster approach for improving cloud system efficiency. ACM Trans Comput Syst (TOCS) 35(3):8
Article Google Scholar
Niu Z, Tang S, He B (2015) Gemini: An adaptive performance-fairness scheduler for data-intensive cluster computing. In: 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 66–73
Niu Z, Tang S, He B (2016) An adaptive efficiency-fairness meta-scheduler for data-intensive computing. IEEE Trans Serv Comput 12(6):865–879
Article Google Scholar
openstack: openstack. https://www.openstack.org/
Orhean AI, Pop F, Raicu I (2018) New scheduling approach using reinforcement learning for heterogeneous distributed systems. J Parallel Distrib Comput 117:292–302
Article Google Scholar
Ousterhout K, Wendell P, Zaharia M, Stoica, I (2013) Sparrow: distributed, low latency scheduling. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 69–84
Park G (2011) A generalization of multiple choice balls-into-bins. In: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing, pp. 297–298. ACM (2011)
Parkes DC, Procaccia AD, Shah N (2015) Beyond dominant resource fairness: extensions, limitations, and indivisibilities. ACM Trans Econ Comput 3(1):3
Article MathSciNet Google Scholar
Peng Y, Bao Y, Chen Y, Wu C, Meng C, Lin W (2021) Dl2: a deep learning-driven scheduler for deep learning clusters. IEEE Trans Parallel Distrib Syst 32(8):1947–1960
Article Google Scholar
Piraghaj SF, Dastjerdi AV, Calheiros RN, Buyya R (2015) A framework and algorithm for energy efficient container consolidation in cloud data centers. In: 2015 IEEE International Conference on Data Science and Data Intensive Systems. IEEE, pp 368–375
Qu H, Mashayekhi O, Terei D, Levis P (2016) Canary: a scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412
Reiss C, Tumanov A, Ganger GR, Katz RH, Kozuch, MA (2012) Heterogeneity and dynamicity of clouds at scale: google trace analysis. In: Proceedings of the third ACM symposium on cloud computing. ACM, p. 7
Rjoub G, Bentahar J (2017) Cloud task scheduling based on swarm intelligence and machine learning. In: 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud). IEEE, pp 272–279
Rjoub G, Bentahar J, Wahab OA (2020) Bigtrustscheduling: trust-aware big data task scheduling approach in cloud computing environments. Future Generation Comput Syst 110:1079–1097
Article Google Scholar
Rodriguez MA, Buyya R (2019) Container-based cluster orchestration systems: a taxonomy and future directions. Softw Pract Exp 49(5):698–719
Article Google Scholar
Sant’Ana L, Carastan-Santos D, Cordeiro D, De Camargo R (2019) Real-time scheduling policy selection from queue and machine states. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID). IEEE, pp 381–390
Scharf M, Stein M, Voith T, Hilt V (2015) Network-aware instance scheduling in open-stack. In: 2015 24th International Conference on Computer Communication and Networks (ICCCN). IEEE, pp 1–6
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: exible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems. ACM, pp 351–364
Shao Y, Li C, Gu J, Zhang J, Luo Y (2018) Efficient jobs scheduling approach for big data applications. Comput Indus Eng 117:249–261
Article Google Scholar
Singh S, Chana I (2016) Cloud resource provisioning: survey, status and future research directions. Knowl Inform Syst 49(3):1005–1069
Article Google Scholar
Singh S, Chana I (2016) A survey on resource scheduling in cloud computing: issues and challenges. J Grid Comput 14(2):217–264
Article Google Scholar
slurm: slurm workload manager. https://slurm.schedmd.com/documentation.html
Software OC Scheduling. https://docs.openstack.org/kilo/config-reference/content/section_compute-scheduler.html#filter-scheduler
Song B, Yu Y, Zhou Y, Wang Z, Du S (2018) Host load prediction with long short-term memory in cloud computing. J Supercomput 74(12):6554–6568
Article Google Scholar
Spark A Apache spark. https://spark.apache.org/
Talluri S, Łuszczak A, Abad CL, Iosup A (2019) Characterization of a big data storage workload in the cloud. In: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, pp 33–44
Thinakaran P, Gunasekaran JR, Sharma B, Kandemir MT, Das CR (2017) Phoenix: a constraint-aware scheduler for heterogeneous datacenters. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 977–987
Tumanov A, Cipar J, Ganger GR, Kozuch MA (2012) alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In: Proceedings of the third ACM symposium on cloud computing. ACM, p 25
Tumanov A, Zhu T, Park JW, Kozuch MA, Harchol-Balter M, Ganger GR (2016) Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In: Proceedings of the Eleventh European Conference on Computer Systems. ACM, p 35
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth, S et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. ACM, p 5
Venkataraman S, Yang Z, Franklin MJ, Recht B, Stoica I (2016) Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp 363–378
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes, J (2015) Large-scale cluster management at google with borg. In: Proceedings of the tenth European Conference on Computer Systems. ACM, p 18
Wang W, Li B, Liang B (2014)Dominant resource fairness in cloud computing systems with heterogeneous servers. In: INFOCOM, 2014 Proceedings IEEE. IEEE, pp 583–591
Wang Y, Liu H, Zheng W, Xia Y, Li Y, Chen P, Guo K, Xie H (2019) Multi-objective workflow scheduling with deep-q-network-based multi-agent reinforcement learning. IEEE Access 7:39974–39982
Article Google Scholar
Weerasiri D, Barukh MC, Benatallah B, Sheng QZ, Ranjan R (2017) A taxonomy and survey of cloud resource orchestration techniques. ACM Comput Surv (CSUR) 50(2):1–41
Article Google Scholar
White T (2012) Hadoop: The definitive guide. ”O’Reilly Media, Inc.”,
Wu F, Wu Q, Tan Y (2015) Workflow scheduling in cloud: a survey. J Supercomput 71(9):3373–3418
Article Google Scholar
Yang Q, Zhou Y, Yu Y, Yuan J, Xing X, Du S (2015) Multi-step-ahead host load prediction using autoencoder and echo state networks in cloud computing. J Supercomput 71(8):3037–3053
Article Google Scholar
Yu Y, Jindal V, Yen IL, Bastani F (2016) Integrating clustering and learning for improved workload prediction in the cloud. In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD). IEEE, pp 876–879
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10:95
Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by the National Science Foundation under grant CNS-1828593.

Author information

Authors and Affiliations

Department of Engineering Management and Systems Engineering, Old Dominion University, 2101 Engineering Systems Building, Norfolk, VA, 23529, USA
Wael Khallouli & Jingwei Huang

Authors

Wael Khallouli
View author publications
You can also search for this author in PubMed Google Scholar
Jingwei Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wael Khallouli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khallouli, W., Huang, J. Cluster resource scheduling in cloud computing: literature review and research challenges. J Supercomput 78, 6898–6943 (2022). https://doi.org/10.1007/s11227-021-04138-z

Download citation

Accepted: 11 October 2021
Published: 29 October 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11227-021-04138-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster resource scheduling in cloud computing: literature review and research challenges

Abstract

Access this article

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

A survey of Kubernetes scheduling algorithms

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cluster resource scheduling in cloud computing: literature review and research challenges

Abstract

Access this article

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

A survey of Kubernetes scheduling algorithms

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation