Abstract:
Communication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms ...Show MoreMetadata
Abstract:
Communication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms exhibit locality among thread blocks and accessed data, modern GPUs lack the interface to exploit this locality: GPU thread blocks are mapped to cores obliviously. In this work, we explore a simple extension to the conventional GPU programming interface to enable control over the spatial placement of data and threads, yielding new opportunities for aggressive locality optimizations within a GPU kernel. Across 7 workloads that can take advantage of these optimizations, for a 32 (or 128) SM GPU: we achieve a 1.28× (1.54×) speedup and 35% (44%) reduction in NoC traffic, compared to baseline non-spatial GPUs.
Published in: IEEE Computer Architecture Letters ( Volume: 23, Issue: 2, July-Dec. 2024)