Abstract
Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-level parallelism. However, this greedy policy usually causes serious cache contention on the SLLC and significantly degrades the system performance. It is therefore a critical performance factor that the thread scheduling of a throughput processor performs a careful trade-off between the thread-level parallelism and cache contention. This article characterizes and analyzes the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this article formally formulates the aggregate working-set-size-constrained thread scheduling problem that constrains the aggregate working-set size on concurrent threads. With a proof to be NP-hard, this article has integrated a series of algorithms to minimize the cache contention and enhance the overall system performance on GPGPUs. The simulation results on NVIDIA's Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google Scholar
- Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). 225--234. Google ScholarDigital Library
- Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'07). Google ScholarDigital Library
- Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput. 22, 462--478. Google ScholarDigital Library
- Yangdong Deng, Bo David Wang, and Shuai Mu. 2009. Taming irregular EDA applications on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'09). 539--546. Google ScholarDigital Library
- Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). 353--364. Google ScholarDigital Library
- Wu-Chun Feng and Shucai Xiao. 2010. To GPU synchronize or not GPU synchronize? In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'10). 3801--3804.Google Scholar
- Michael R. Garey, Ronald L. Graham, and Jeffery D. Ullman. 1972. Worst-case analysis of memory allocation algorithms. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (STOC'72). 143--150. Google ScholarDigital Library
- Michael Garland and David B. Kirk. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarDigital Library
- Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of the Conference on Innovative Parallel Computing (InPar'12).Google Scholar
- Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17, 606--618. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13). 395--406. Google ScholarDigital Library
- Daniel R. Johnson, Matthew R. Johnson, John H. Kelm, William Tuohy, Steven S. Lumetta, and Sanjay J. Patel. 2011. Rigel: A 1,024-core single-chip accelerator architecture. IEEE Micro 31, 30--41. Google ScholarDigital Library
- Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2012. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. http://www.cse.psu.edu/~oik5019/docs/pdf/NMNL-PACT2013.pdf. Google ScholarDigital Library
- Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro. 31, 7--17. Google ScholarDigital Library
- Khronos. 2011. The opencl specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.Google Scholar
- Kenneth L. Krause, Vincent Y. Shen, and Herb D. Schwetman. 1975. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems. J. ACM. 22, 522--550. Google ScholarDigital Library
- Hsien-Kai Kuo, Kuan-Ting Chen, Bo-Cheng Charles Lai, and Jing-Yang Jou. 2012. Thread affinity mapping for irregular data access on shared cache GPGPU. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASP-DAC'12). 659--664.Google Scholar
- Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. 2014. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Trans. Comput. PP, 99, 1--1.Google Scholar
- John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro 30, 56--69. Google ScholarDigital Library
- Nvidia. 2012a. NVIDIA kepler compute architecture whitepaper. http://www.nvidia.com/object/nvidia-kepler.html.Google Scholar
- Nvidia. 2012b. NVIDIA cuda C programming guide 4.1. https://developer.nvidia.com/cuda-toolkit-41-archive/.Google Scholar
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12). 72--83. Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13). 99--110. Google ScholarDigital Library
- Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-Mei W. Hwu. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'08). 195--204. Google ScholarDigital Library
- Jinuk Luke Shin, Kenway Tam, Dawei Huang, Bruce Petrick, Ha Pham, Changku Hwang, Hongping Li, Alan Smith, Timothy Johnson, Francis Schumacher, David Greenhill, Ana Sonia Leon, and Allan Strong. 2010. A 40nm 16-core 128-thread CMT SPARC SoC processor. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). 98--99.Google ScholarCross Ref
- Craig M. Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi GF100 GPU architecture. IEEE Micro 31, 50--59. Google ScholarDigital Library
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.Google ScholarCross Ref
- Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium Principles and Practice of Parallel Programming (PPoPP'13). 57--68. Google ScholarDigital Library
- Jian Yang and Joseph Y.-T. Leung. 2003. The ordered open-end bin-packing problem. Oper. Res. 51, 759--770. Google ScholarDigital Library
- Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 86--97. Google ScholarDigital Library
- Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011a. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'11). 369--380. Google ScholarDigital Library
- Yuanrui Zhang, Mahmut Kandemir, and Taylan Yemliha. 2011b. Studying inter-core data reuse in multicores. In Proceedings of the ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'11). 25--36. Google ScholarDigital Library
- Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2012. The significance of CMP cache sharing on contemporary multithreaded applications. IEEE Trans. Parallel Distrib. Syst. 23, 367--374. Google ScholarDigital Library
Index Terms
- Reducing Contention in Shared Last-Level Cache for Throughput Processors
Recommendations
Managing shared last-level cache in a heterogeneous multicore processor
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesHeterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important ...
Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache
AbstractAlthough the shared last-level cache (SLLC) occupies a significant portion of multicore CPU chip die area, more than 59% of SLLC cache blocks are not reused during their lifetime. If we can filter out these useless blocks from SLLC, we can ...
Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors
ICS '16: Proceedings of the 2016 International Conference on SupercomputingRecent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different ...
Comments