Abstract:
Modern GPUs feature complex memory system designs. One GPU may contain many types of memory of different properties. The best way to place data in memory is sensitive to ...Show MoreMetadata
Abstract:
Modern GPUs feature complex memory system designs. One GPU may contain many types of memory of different properties. The best way to place data in memory is sensitive to many factors (e.g., program inputs, architectures), making portable optimizations of GPU data placement a difficult challenge. PORPLE is a recently proposed method that overcomes the difficulties by enabling online optimizations of data placement through a three-way synergy: a specification language for memory system description, a compiler framework for data access analysis and code staging, and a runtime library for efficiently finding and materializing data placement on the fly. This article provides a comprehensive description of this method, and presents several extensions that significantly improve the scalability of PORPLE, which include a novel algorithm design for efficiently searching for the best data placements, the use of active profiling for reducing the online-profiling overhead, and a systematic examination of a path-based performance model. By automatically tailoring data placements for each execution of a GPU program, the enhanced PORPLE brings significant speedups (1.72X on average) to many GPU kernels across GPU architectures and program inputs.
Published in: IEEE Transactions on Computers ( Volume: 66, Issue: 3, 01 March 2017)