skip to main content
10.1145/3407947.3407977acmotherconferencesArticle/Chapter ViewAbstractPublication Pageshp3cConference Proceedingsconference-collections
research-article

A Survey of System Scheduling for HPC and Big Data

Published: 06 August 2020 Publication History

Abstract

In the rapidly expanding field of parallel processing, job schedulers act as the "operating systems" of the clusters, including modern big data architectures and supercomputing systems. Job schedulers manage and allocate system resources, dispatch the queued jobs, and control the execution of processes on the allocated resources. In this paper, we firstly make an introduction to the cluster schedulers. Then according to the scenarios, we make a comprehensive survey of schedulers for HPC and Big Data. We can conclude that most of these current schedulers are centralized, which means master assigns jobs to the slaves. We call this mode Push, which is different from our new idea that introduces Pull to the schedulers. We proposed a novel scheduling model that allow slaves to actively pull jobs from master to execute. By analyzing the execution time and resource requests of jobs in "Tianhe-II", we will clarify that scheduling based on Push & Pull is a direction worthy of in-depth study in the future.

References

[1]
Schwiegelshohn, U., Yahyapour, R. 1998. Analysis of first-come-first-serve parallel job scheduling. In: SODA. Vol. 98. Citeseer, pp. 629--638.
[2]
Isard, Michael, Prabhakaran, Vijayan, Currey, Jon. 2009. Quincy: fair scheduling for distributed computing clusters. IEEE International Conference on Recent Trends in Information Systems.
[3]
Bernat G, Burns A. Multiple Servers and Capacity Sharing for Implementing Flexible Scheduling. Real Time Systems, 2002, 22(1-2):49--75.
[4]
Hindman, B., Konwinski, A. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In: NSDI. Vol. 11. pp. 22--22.
[5]
Pearl J, Verma T. 1987. The Logic of Representing Dependencies by Directed Graphs. National Conference on Artificial Intelligence Seattle. DBLP.
[6]
Poplavko P, Basten T, Meerbergen J V. 2007. Execution-time Prediction for Dynamic Streaming Applications with Task-level Parallelism. Euromicro Conference on Digital System Design Architectures.
[7]
Muskens J, Chaudron M. 2004. Prediction of Run-Time Resource Consumption in Multitask Component-Based Software Systems. Component-Based Software Engineering, 7th International Symposium, CBSE 2004, Edinburgh, UK, May 24-25, 2004.
[8]
Ali Ghodsi, Matei Zaharia, Benjamin Hi Hindman. Dominant Resource Fairness (DRF) Fair Allocation of Multiple Resource Types. Nsdi, 2013:323--336.
[9]
Zaharia M, Borthakur D, Sarma J S, et al. 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. European Conference on European Conference on Computer Systems.
[10]
https://hadoop.apache.org/
[11]
Maity S, Bonthu S, Sasmal K, et al. 2013. Role of Parallel Computing in Numerical Weather Forecasting Models. International Journal of Computer Applications, 2013, CCSN2012(4) (Special Issue):975--8887.
[12]
Wilson T, Tan P N, Luo L. 2018. A Low Rank Weighted Graph Convolutional Approach to Weather Prediction. 2018 IEEE International Conference on Data Mining (ICDM).
[13]
M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, J. Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. Proceedings of the ACM European Conference on Computer Systems. 2013:351--364.
[14]
BK Kingsbury. December 1986. The Network Queueing System. NASA CONTRACTOR REPORT 177433.
[15]
RL Henderson. Job scheduling under the portable batch system. Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. 1995:279--294.
[16]
S Zhou, X Zheng, J Wang, P Delisle. Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Software Practice & Experience. 1993:23(12):1305--1336.
[17]
LSF.https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_foundations/lsf_introduction_to.html.
[18]
http://www.openlava.org/
[19]
T Newhouse, J Pasquale. ALPS: An application-level proportional-share scheduler. Proceedings of the IEEE International Symposium on High Performance Distributed Computing. 2006:279--290.
[20]
Yoo A B, Jette M A, Grondona M. 2003. SLURM: Simple Linux Utility for Resource Management.
[21]
Andy B. Yao, Morris A. Jette, Mark Grondona. Slurm: Simple Linux utility for resource management. Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. 2003:44--60.
[22]
W Gentzsch. Codine. 1994. Computing in Distributed Networked Environments, User's Guide and Reference Manual. Genias Software GmbH, Erzgebirgstr. 2B, D-93073 Neutraubling, Germany.
[23]
M. J. Litzkow, M. Livny, M. W. Mutka. Condor-a hunter of idle workstations. Proceedings of the 8th International Conference on Distributed Computing Systems. 1988:104--111.
[24]
Douglas Thain, Todd Tannenbaum, Miron Livny, Distributed Computing in Practice: The Condor Experience Concurrency and Computation: Practice and Experience, February-April, 2005, Vol. 17, No. 2-4, pages 323-356.
[25]
Taylor S. 2013. High Performance Computing of Hydrologic Models Using HT Condor. Brigham Young University.
[26]
Yang C, Li H, Rezgui Y, et al. High throughput computing based distributed genetic algorithm for building energy consumption optimization. Energy and Buildings, 2014, 76: 92--101.
[27]
Jeffrey Dean, Sanjay Ghemawa. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the conference on Symposium on Operating Systems Design & Implementation. 2004:1--13.
[28]
A Verma, L Pedrosa, M Korupolu, D Oppenheimer, E Tune, J Wilkes. Large-scale cluster management at Google with Borg. Proceedings of the Tenth European Conference on Computer Systems. 2015:1--17.
[29]
Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes. Borg, Omega, and Kubernetes. Communications of the ACM. 2016:59(5):50--57.
[30]
E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. Proceedings of the USENIX Conference on Operating Systems Design & Implementation. 2014:285--300.
[31]
K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G.M. Fumarola, S. Heddaya, R. Ramakrishnan, S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. Proceedings of the USENIX Conference on USENIX Technical Conference. 2015:485--497.
[32]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al. Apache Hadoop YARN: Yet another resource negotiator. Proceedings of the Symposium on Cloud Computing. 2013:1--16.
[33]
Dittrich, J., Quian'e-Ruiz, J.-A., 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5 (12), 2014-2015.
[34]
Bhattacharya, A. Arka, Culler, David, E. Friedman, A. Ghodsi, S. Shenker, I. Stoica. Hierarchical scheduling for diverse datacenter workloads. Proceedings of the 4th Annual Symposium on Cloud Computing. 2013:4--15.
[35]
Cloudera, July 2016. Llama - low latency application master. http://cloudera.github.io/llama/index.html. URL http://cloudera.github.io/llama/index.html
[36]
B Saha, H Shah, S Seth, G Vijayaraghavan, A Murthy, C Curino. Apache Tez: A unifying framework for modeling and building data processing applications. Proceedings of the International Conference on Management of Data. 2015:1357--1369.
[37]
I Gog, M Schwarzkopf, A Gleave, RVM Watson, S Hand. Firmament: Fast, centralized cluster scheduling at scale. Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 2016:99--115.
[38]
K Ousterhout, P Wendell, M Zaharia, I Stoica. Sparrow: distributed, low latency scheduling. Proceedings of the ACM Symposium on Operating Systems Principles. 2013:69--84.

Cited By

View all
  • (2024)Technical Challenges for AI in Space Data CentersIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642138(1733-1735)Online publication date: 7-Jul-2024
  • (2022)Self-deployed execution environment for high performance computing面向高性能计算的自部署运行环境Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210001623:6(845-857)Online publication date: 4-Mar-2022
  • (2021)Starvation Avoidance Task Scheduling Algorithm for Heterogeneous Computing Systems2021 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI54926.2021.00339(1794-1799)Online publication date: Dec-2021

Index Terms

  1. A Survey of System Scheduling for HPC and Big Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    HP3C 2020: Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications
    June 2020
    191 pages
    ISBN:9781450376914
    DOI:10.1145/3407947
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Xi'an Jiaotong-Liverpool University: Xi'an Jiaotong-Liverpool University
    • City University of Hong Kong: City University of Hong Kong
    • Guangdong University of Technology: Guangdong University of Technology

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Job scheduler
    2. big data cluster
    3. decentralized scheduling
    4. high performance computing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    HP3C 2020

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)62
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Technical Challenges for AI in Space Data CentersIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642138(1733-1735)Online publication date: 7-Jul-2024
    • (2022)Self-deployed execution environment for high performance computing面向高性能计算的自部署运行环境Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210001623:6(845-857)Online publication date: 4-Mar-2022
    • (2021)Starvation Avoidance Task Scheduling Algorithm for Heterogeneous Computing Systems2021 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI54926.2021.00339(1794-1799)Online publication date: Dec-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media