research-article

A Survey of System Scheduling for HPC and Big Data

Authors:

Nong XiaoAuthors Info & Claims

HP3C 2020: Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications

Pages 178 - 183

https://doi.org/10.1145/3407947.3407977

Published: 06 August 2020 Publication History

Abstract

In the rapidly expanding field of parallel processing, job schedulers act as the "operating systems" of the clusters, including modern big data architectures and supercomputing systems. Job schedulers manage and allocate system resources, dispatch the queued jobs, and control the execution of processes on the allocated resources. In this paper, we firstly make an introduction to the cluster schedulers. Then according to the scenarios, we make a comprehensive survey of schedulers for HPC and Big Data. We can conclude that most of these current schedulers are centralized, which means master assigns jobs to the slaves. We call this mode Push, which is different from our new idea that introduces Pull to the schedulers. We proposed a novel scheduling model that allow slaves to actively pull jobs from master to execute. By analyzing the execution time and resource requests of jobs in "Tianhe-II", we will clarify that scheduling based on Push & Pull is a direction worthy of in-depth study in the future.

References

[1]

Schwiegelshohn, U., Yahyapour, R. 1998. Analysis of first-come-first-serve parallel job scheduling. In: SODA. Vol. 98. Citeseer, pp. 629--638.

[2]

Isard, Michael, Prabhakaran, Vijayan, Currey, Jon. 2009. Quincy: fair scheduling for distributed computing clusters. IEEE International Conference on Recent Trends in Information Systems.

Digital Library

[3]

Bernat G, Burns A. Multiple Servers and Capacity Sharing for Implementing Flexible Scheduling. Real Time Systems, 2002, 22(1-2):49--75.

Digital Library

[4]

Hindman, B., Konwinski, A. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In: NSDI. Vol. 11. pp. 22--22.

[5]

Pearl J, Verma T. 1987. The Logic of Representing Dependencies by Directed Graphs. National Conference on Artificial Intelligence Seattle. DBLP.

[6]

Poplavko P, Basten T, Meerbergen J V. 2007. Execution-time Prediction for Dynamic Streaming Applications with Task-level Parallelism. Euromicro Conference on Digital System Design Architectures.

[7]

Muskens J, Chaudron M. 2004. Prediction of Run-Time Resource Consumption in Multitask Component-Based Software Systems. Component-Based Software Engineering, 7th International Symposium, CBSE 2004, Edinburgh, UK, May 24-25, 2004.

[8]

Ali Ghodsi, Matei Zaharia, Benjamin Hi Hindman. Dominant Resource Fairness (DRF) Fair Allocation of Multiple Resource Types. Nsdi, 2013:323--336.

[9]

Zaharia M, Borthakur D, Sarma J S, et al. 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. European Conference on European Conference on Computer Systems.

Digital Library

[10]

https://hadoop.apache.org/

[11]

Maity S, Bonthu S, Sasmal K, et al. 2013. Role of Parallel Computing in Numerical Weather Forecasting Models. International Journal of Computer Applications, 2013, CCSN2012(4) (Special Issue):975--8887.

[12]

Wilson T, Tan P N, Luo L. 2018. A Low Rank Weighted Graph Convolutional Approach to Weather Prediction. 2018 IEEE International Conference on Data Mining (ICDM).

[13]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, J. Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. Proceedings of the ACM European Conference on Computer Systems. 2013:351--364.

Digital Library

[14]

BK Kingsbury. December 1986. The Network Queueing System. NASA CONTRACTOR REPORT 177433.

[15]

RL Henderson. Job scheduling under the portable batch system. Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. 1995:279--294.

[16]

S Zhou, X Zheng, J Wang, P Delisle. Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Software Practice & Experience. 1993:23(12):1305--1336.

[17]

LSF.https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_foundations/lsf_introduction_to.html.

[18]

http://www.openlava.org/

[19]

T Newhouse, J Pasquale. ALPS: An application-level proportional-share scheduler. Proceedings of the IEEE International Symposium on High Performance Distributed Computing. 2006:279--290.

[20]

Yoo A B, Jette M A, Grondona M. 2003. SLURM: Simple Linux Utility for Resource Management.

[21]

Andy B. Yao, Morris A. Jette, Mark Grondona. Slurm: Simple Linux utility for resource management. Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. 2003:44--60.

[22]

W Gentzsch. Codine. 1994. Computing in Distributed Networked Environments, User's Guide and Reference Manual. Genias Software GmbH, Erzgebirgstr. 2B, D-93073 Neutraubling, Germany.

[23]

M. J. Litzkow, M. Livny, M. W. Mutka. Condor-a hunter of idle workstations. Proceedings of the 8th International Conference on Distributed Computing Systems. 1988:104--111.

[24]

Douglas Thain, Todd Tannenbaum, Miron Livny, Distributed Computing in Practice: The Condor Experience Concurrency and Computation: Practice and Experience, February-April, 2005, Vol. 17, No. 2-4, pages 323-356.

[25]

Taylor S. 2013. High Performance Computing of Hydrologic Models Using HT Condor. Brigham Young University.

[26]

Yang C, Li H, Rezgui Y, et al. High throughput computing based distributed genetic algorithm for building energy consumption optimization. Energy and Buildings, 2014, 76: 92--101.

[27]

Jeffrey Dean, Sanjay Ghemawa. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the conference on Symposium on Operating Systems Design & Implementation. 2004:1--13.

[28]

A Verma, L Pedrosa, M Korupolu, D Oppenheimer, E Tune, J Wilkes. Large-scale cluster management at Google with Borg. Proceedings of the Tenth European Conference on Computer Systems. 2015:1--17.

[29]

Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes. Borg, Omega, and Kubernetes. Communications of the ACM. 2016:59(5):50--57.

[30]

E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. Proceedings of the USENIX Conference on Operating Systems Design & Implementation. 2014:285--300.

[31]

K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G.M. Fumarola, S. Heddaya, R. Ramakrishnan, S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. Proceedings of the USENIX Conference on USENIX Technical Conference. 2015:485--497.

[32]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al. Apache Hadoop YARN: Yet another resource negotiator. Proceedings of the Symposium on Cloud Computing. 2013:1--16.

[33]

Dittrich, J., Quian'e-Ruiz, J.-A., 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5 (12), 2014-2015.

Digital Library

[34]

Bhattacharya, A. Arka, Culler, David, E. Friedman, A. Ghodsi, S. Shenker, I. Stoica. Hierarchical scheduling for diverse datacenter workloads. Proceedings of the 4th Annual Symposium on Cloud Computing. 2013:4--15.

[35]

Cloudera, July 2016. Llama - low latency application master. http://cloudera.github.io/llama/index.html. URL http://cloudera.github.io/llama/index.html

[36]

B Saha, H Shah, S Seth, G Vijayaraghavan, A Murthy, C Curino. Apache Tez: A unifying framework for modeling and building data processing applications. Proceedings of the International Conference on Management of Data. 2015:1357--1369.

[37]

I Gog, M Schwarzkopf, A Gleave, RVM Watson, S Hand. Firmament: Fast, centralized cluster scheduling at scale. Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 2016:99--115.

[38]

K Ousterhout, P Wendell, M Zaharia, I Stoica. Sparrow: distributed, low latency scheduling. Proceedings of the ACM Symposium on Operating Systems Principles. 2013:69--84.

Cited By

Eda TBusto MUdagawa TIshihama NTabata KMatsuo YYamasaki I(2024)Technical Challenges for AI in Space Data CentersIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642138(1733-1735)Online publication date: 7-Jul-2024
https://doi.org/10.1109/IGARSS53475.2024.10642138
Shao MLu KZhang W(2022)Self-deployed execution environment for high performance computing面向高性能计算的自部署运行环境Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210001623:6(845-857)Online publication date: 4-Mar-2022
https://doi.org/10.1631/FITEE.2100016
Gawanmeh AMansoor WAbed SKablaoui DAl Faisal H(2021)Starvation Avoidance Task Scheduling Algorithm for Heterogeneous Computing Systems2021 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI54926.2021.00339(1794-1799)Online publication date: Dec-2021
https://doi.org/10.1109/CSCI54926.2021.00339

Index Terms

A Survey of System Scheduling for HPC and Big Data
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Client-server architectures

Recommendations

Enabling Workflow-Aware Scheduling on HPC Systems
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. ...
A classification of hadoop job schedulers based on performance optimization approaches
Abstract
Job scheduling in MapReduce plays a vital role in Hadoop performance. In recent years, many researchers have presented job scheduler algorithms to improve Hadoop performance. Designing a job scheduler that minimizes job execution time with maximum ...
Millipedes: Distributed and Set-Based Sub-Task Scheduler of Computing Engines Running on Yarn Cluster
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Hadoop YARN is evolving to become the de-facto standard that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. And, there are lots of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HP3C 2020: Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications

June 2020

191 pages

ISBN:9781450376914

DOI:10.1145/3407947

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Xi'an Jiaotong-Liverpool University: Xi'an Jiaotong-Liverpool University
City University of Hong Kong: City University of Hong Kong
Guangdong University of Technology: Guangdong University of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HP3C 2020

HP3C 2020: 2020 4th International Conference on High Performance Compilation, Computing and Communications

June 27 - 29, 2020

Guangzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
258
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)9

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Eda TBusto MUdagawa TIshihama NTabata KMatsuo YYamasaki I(2024)Technical Challenges for AI in Space Data CentersIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642138(1733-1735)Online publication date: 7-Jul-2024
https://doi.org/10.1109/IGARSS53475.2024.10642138
Shao MLu KZhang W(2022)Self-deployed execution environment for high performance computing面向高性能计算的自部署运行环境Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210001623:6(845-857)Online publication date: 4-Mar-2022
https://doi.org/10.1631/FITEE.2100016
Gawanmeh AMansoor WAbed SKablaoui DAl Faisal H(2021)Starvation Avoidance Task Scheduling Algorithm for Heterogeneous Computing Systems2021 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI54926.2021.00339(1794-1799)Online publication date: Dec-2021
https://doi.org/10.1109/CSCI54926.2021.00339

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten