ABSTRACT
Job schedulers play an important role in selecting optimal resources for the submitted jobs. However, most of the current job schedulers do not consider job-specific characteristics such as communication patterns during resource allocation. This often leads to sub-optimal node allocations. We propose three node allocation algorithms that consider the job’s communication behavior to improve the performance of communication-intensive jobs. We develop our algorithms for tree-based network topologies. The proposed algorithms aim at minimizing network contention by allocating nodes on the least contended switches. We also show that allocating nodes in powers of two leads to a decrease in inter-switch communication for MPI communications, which further improves performance. We implement and evaluate our algorithms using SLURM, a widely-used and well-known job scheduler. We show that the proposed algorithms can reduce the execution times of communication-intensive jobs by 9% (326 hours) on average. The average wait time of jobs is reduced by 31% across three supercomputer job logs.
- 2005. Parallel Workload Archive. www.cse.huji.ac.il/labs/parallel/workload/Google Scholar
- 2019. ALCF, ANL. https://reports.alcf.anl.gov/data/index.htmlGoogle Scholar
- 2020. Mira and Theta. https://www.alcf.anl.gov/alcf-resourcesGoogle Scholar
- 2020. MPICH. https://www.mpich.org.Google Scholar
- 2020. MPICH Source Code. https://github.com/pmodels/mpich.Google Scholar
- Laksono Adhianto and et al.2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6(2010), 685–701.Google ScholarCross Ref
- T. Agarwal, A. Sharma, A. Laxmikant, and L. V. Kale. 2006. Topology-aware task mapping for reducing communication contention on large parallel machines. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium.Google ScholarCross Ref
- B. Brandfass, T. Alrutz, and T. Gerhold. 2013. Rank reordering for MPI communication optimization. Computers & Fluids 80(2013), 372 – 380.Google ScholarCross Ref
- Sudheer Chunduri and et al.2018. Characterization of MPI Usage on a Production Supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(SC ’18).Google Scholar
- W. Cirne and F. Berman. 2001. A comprehensive model of the supercomputer workload. In Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538). 140–148.Google Scholar
- Dror G. Feitelson, Dan Tsafrir, and David Krakov. 2014. Experience with using the Parallel Workloads Archive. J. Parallel and Distrib. Comput. 74, 10 (2014).Google ScholarCross Ref
- Yiannis Georgiou, Emmanuel Jeannot, Guillaume Mercier, and Adèle Villiermet. 2018. Topology-Aware Job Mapping. Int. J. High Perform. Comput. Appl. 32, 1 (2018), 14–27.Google ScholarDigital Library
- Robert L. Henderson. 1995. Job Scheduling Under the Portable Batch System. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing(IPPS ’95). Springer-Verlag, Berlin, Heidelberg, 279–294.Google ScholarDigital Library
- Emmanuel Jeannot and Guillaume Mercier. 2010. Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures. In Euro-Par 2010 - Parallel Processing. 199–210.Google Scholar
- Ana Jokanovic, Jose Carlos Sancho, German Rodriguez, Alejandro Lucero, Cyriel Minkenberg, and Jesus Labarta. 2015. Quiet Neighborhoods: Key to Protect Job Performance Predictability. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium. 449–459.Google ScholarDigital Library
- Benjamin Klenk and Holger Fröning. 2017. An Overview of MPI Characteristics of Exascale Proxy Applications. In High Performance Computing. Springer International Publishing, 217–236.Google Scholar
- C. E. Leiserson. 1985. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Comput.10(1985), 892–901.Google ScholarCross Ref
- Hui Li, David Groep, and Lex Wolters. 2004. Workload Characteristics of a Multi-Cluster Supercomputer. In Proceedings of the 10th International Conference on Job Scheduling Strategies for Parallel Processing (New York, NY) (JSSPP’04). Springer-Verlag, Berlin, Heidelberg, 176–193. https://doi.org/10.1007/11407522_10Google ScholarDigital Library
- Jose A Morinigo and et al.2020. Performance drop at executing communication-intensive parallel algorithms. Journal of Supercomputing(2020).Google Scholar
- Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. 2018. Evaluation of an Interference-Free Node Allocation Policy on Fat-Tree Clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(SC ’18). IEEE Press.Google ScholarDigital Library
- Gilad Shainer, Tong Liu, Pak Lui, and Richard Graham. 2011. Accelerating High Performance Computing Applications Through MPI Offloading. Technical Report. HPC Advisory Council.Google Scholar
- Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20, 2 (2006), 287–311.Google ScholarDigital Library
- Seren Soner and Can Özturan. 2014. Topologically Aware Job Scheduling for SLURM. Technical Report. Partnership for Advanced Computing in Europe.Google Scholar
- Hari Subramoni, Devendar Bureddy, Krishna Chaitanya Kandalla, Karl W. Schulz, Bill Barth, Jonathan L. Perkins, Mark Daniel Arnold, and Dhabaleswar K. Panda. 2013. Design of network topology aware scheduling services for large InfiniBand clusters. IEEE International Conference on Cluster Computing (CLUSTER) (2013).Google ScholarCross Ref
- Rajeev Thakur and William D. Gropp. 2003. Improving the Performance of Collective Operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Domenico Laforenza, and Salvatore Orlando (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 257–267.Google Scholar
- Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. The International Journal of High Performance Computing Applications 19, 1(2005), 49–66.Google ScholarDigital Library
- Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing. 44–60.Google Scholar
Recommendations
Online Flexible Job Scheduling for Minimum Span
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and ArchitecturesIn this paper, we study an online Flexible Job Scheduling (FJS) problem. The input of the problem is a set of jobs, each having an arrival time, a starting deadline and a processing length. Each job has to be started by the scheduler between its arrival ...
Job scheduling to minimize the weighted waiting time variance of jobs
This study considers the job scheduling problem of minimizing the weighted waiting time variance (WWTV) of jobs. It is an extension of WTV minimization problems in which we schedule a batch of n jobs, for servicing on a single resource, in such a way ...
Modified Rate-Monotonic Algorithm for Scheduling Periodic Jobs with Deferred Deadlines
The deadline of a request is the time instant at which its execution must complete. The deadline of the request in any period of a job with deferred deadline is some time instant after the end of the period. The authors describe a semi-static priority-...
Comments