Abstract
Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable jobs.
In this paper we consider a typical scenario where a data center composed of N servers will be tasked with completing a set of M parallelizable jobs. Typically, M is much smaller than N. In our scenario, each job consists of some amount of inherent work which we refer to as a job's size. We assume that job sizes are known up front to the system, and each job can utilize any number of servers at any moment in time. These assumptions are reasonable for many parallelizable workloads such as training neural networks using TensorFlow [2]. Our goal in this paper is to allocate servers to jobs so as to minimize the mean slowdown across all jobs, where the slowdown of a job is the job's completion time divided by its running time if given exclusive access to all N servers. Slowdown measures how a job was interfered with by other jobs in the system, and is often the metric of interest in the theoretical parallel scheduling literature (where it is also called stretch), as well as the HPC community (where it is called expansion factor).
- B. Berg, J.P. Dorsman, and M. Harchol-Balter. Towards optimality in parallel scheduling. ACM POMACS, 1(2), 2018.Google Scholar
- S. Lin, M. Paolieri, C. Chou, and L. Golubchik. A model-based approach to streamlining distributed training for asynchronous SGD. In MASCOTS. IEEE, 2018.Google ScholarCross Ref
- Donald R Smith. A new proof of the optimality of the shortest remaining processing time discipline. Operations Research, 26(1):197--199, 1978.Google ScholarDigital Library
- Adam Wierman, Mor Harchol-Balter, and Takayuki Osogami. Nearly insensitive bounds on SMART scheduling. SIGMETRICS, 33(1):205--216, 2005.Google ScholarDigital Library
Index Terms
- heSRPT: Parallel Scheduling to Minimize Mean Slowdown
Recommendations
heSRPT: Optimal Scheduling of Parallel Jobs with Known Sizes
Nearly all modern data centers serve workloads which are capable of exploiting parallelism. When a job parallelizes across multiple servers it will complete more quickly, but jobs receive diminishing returns from being allocated additional servers. ...
Scheduling of deteriorating jobs with release dates to minimize the maximum lateness
In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Primary-secondary bicriteria scheduling on identical machines to minimize the total completion time of all jobs and the maximum T-time of all machines
In this paper, we study a new primary-secondary bicriteria scheduling problem on identical machines. The primary objective is to minimize the total completion time of all jobs and the secondary objective is to minimize the maximum T-time of all machines,...
Comments