KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers | IEEE Conference Publication | IEEE Xplore

KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers


Abstract:

Machine learning (ML) inference workloads present significantly different challenges than ML training workloads. Typically, inference workloads are shorter running and un...Show More

Abstract:

Machine learning (ML) inference workloads present significantly different challenges than ML training workloads. Typically, inference workloads are shorter running and under-utilize GPU resources. To overcome this, co-locating multiple instances of a model has been proposed to improve the utilization of GPUs. Co-located models share the GPU through GPU spatial partitioning facilities, such as Nvidia’s MPS, MIG, or AMD’s CU Masking API. Existing spatially partitioned inference servers create model-wise partitions by "right-sizing" based on a model’s latency tolerance to restricting resources. We show that model-wise right-sizing is under-utilized due to varying resource restriction tolerance of individual kernels within an inference pass.We propose Kernel-wise Right-sizing for Spatial Partitioned GPU Inference Servers (KRISP) to enable kernel-wise right-sizing of spatial partitions at the granularity of individual kernels. We demonstrate that KRISP can support a greater level of concurrently running inference models compared to existing spatially partitioned inference servers. KRISP improves overall throughput by 2x when compared with an isolated inference (1.22x vs prior works) and reduce energy per inference by 33%.
Date of Conference: 25 February 2023 - 01 March 2023
Date Added to IEEE Xplore: 24 March 2023
ISBN Information:

ISSN Information:

Conference Location: Montreal, QC, Canada

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.