Abstract:
Torus-connected network is widely used in modern supercomputers due to its linear per node cost scaling and its competitive overall performance. Job scheduling system pla...Show MoreMetadata
Abstract:
Torus-connected network is widely used in modern supercomputers due to its linear per node cost scaling and its competitive overall performance. Job scheduling system plays a critical role for the efficient use of supercomputers. As supercomputers continue growing in size, a fundamental problem arises: how to effectively balance job performance with system performance on torus-connected machines? In this work, we will present a new scheduling design named window-based locality-aware scheduling. Our design contains three novel features. First, rather than one-by-one job scheduling, our design takes a “window” of jobs, i.e. multiple jobs, into consideration for job prioritizing and resource allocation. Second, our design maintains a list of slots to preserve node contiguity information for resource allocation. Finally, we formulate our scheduling decision making into a 0-1 Multiple Knapsack Problem and present two algorithms to solve the problem. A series of trace-based simulations using job logs collected from production supercomputers indicate that this new scheduling design has real potentials and can effectively balance job performance and system performance.
Date of Conference: 22-26 September 2014
Date Added to IEEE Xplore: 01 December 2014
Electronic ISBN:978-1-4799-5548-0