skip to main content
research-article

heSRPT: Parallel Scheduling to Minimize Mean Slowdown

Published:05 March 2021Publication History
Skip Abstract Section

Abstract

Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable jobs.

In this paper we consider a typical scenario where a data center composed of N servers will be tasked with completing a set of M parallelizable jobs. Typically, M is much smaller than N. In our scenario, each job consists of some amount of inherent work which we refer to as a job's size. We assume that job sizes are known up front to the system, and each job can utilize any number of servers at any moment in time. These assumptions are reasonable for many parallelizable workloads such as training neural networks using TensorFlow [2]. Our goal in this paper is to allocate servers to jobs so as to minimize the mean slowdown across all jobs, where the slowdown of a job is the job's completion time divided by its running time if given exclusive access to all N servers. Slowdown measures how a job was interfered with by other jobs in the system, and is often the metric of interest in the theoretical parallel scheduling literature (where it is also called stretch), as well as the HPC community (where it is called expansion factor).

References

  1. B. Berg, J.P. Dorsman, and M. Harchol-Balter. Towards optimality in parallel scheduling. ACM POMACS, 1(2), 2018.Google ScholarGoogle Scholar
  2. S. Lin, M. Paolieri, C. Chou, and L. Golubchik. A model-based approach to streamlining distributed training for asynchronous SGD. In MASCOTS. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  3. Donald R Smith. A new proof of the optimality of the shortest remaining processing time discipline. Operations Research, 26(1):197--199, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Adam Wierman, Mor Harchol-Balter, and Takayuki Osogami. Nearly insensitive bounds on SMART scheduling. SIGMETRICS, 33(1):205--216, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. heSRPT: Parallel Scheduling to Minimize Mean Slowdown
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGMETRICS Performance Evaluation Review
          ACM SIGMETRICS Performance Evaluation Review  Volume 48, Issue 3
          December 2020
          140 pages
          ISSN:0163-5999
          DOI:10.1145/3453953
          Issue’s Table of Contents

          Copyright © 2021 Copyright is held by the owner/author(s)

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 March 2021

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)6
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader