ABSTRACT
Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks. We show that the sharing is not well exploited because current GPU runtime ignores such a factor when scheduling threads. We then present an application-level transformation to remap thread blocks to data on the fly. With the software-level scheduler, thread blocks with much data sharing are scheduled to share the cache on a streaming multiprocessor (SM). Experiments on four benchmarks show 1.23X speedup on average.
- W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS, 2012. Google ScholarDigital Library
- B. Wu, E. Zhang, and X. Shen. Enhancing data locality for dynamic simulations through asynchronous data transformations and adaptive control. In PACT, 2011. Google ScholarDigital Library
- B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In PPoPP, 2013. Google ScholarDigital Library
- E. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In ASPLOS, 2011. Google ScholarDigital Library
- E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In PPoPP, 2010. Google ScholarDigital Library
Index Terms
- Software-level scheduling to exploit non-uniformly shared data cache on GPGPU
Recommendations
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsRecent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Tag-Split Cache for Efficient GPGPU Cache Utilization
ICS '16: Proceedings of the 2016 International Conference on SupercomputingModern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using ...
Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture
ISPDC '15: Proceedings of the 2015 14th International Symposium on Parallel and Distributed ComputingThe data exchange between GPGPUs and CPUs are becoming more and more important nowadays. One trend in industry to alleviate the long latency is to integrate CPUs and GPGPUs on a single chip. In this paper, we analyze the reference interactions between ...
Comments