research-article

Reducing Contention in Shared Last-Level Cache for Throughput Processors

Authors:
Hsien-Kai Kuo

National Chiao-Tung University, Taiwan

National Chiao-Tung University, Taiwan
View Profile

,
Bo-Cheng Charles Lai

National Chiao-Tung University, Taiwan

National Chiao-Tung University, Taiwan
View Profile

,
Jing-Yang Jou

National Central University and National Chiao-Tung University, Taiwan

National Central University and National Chiao-Tung University, Taiwan
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 20 Issue 1Article No.: 12pp 1–28https://doi.org/10.1145/2676550

Published:18 November 2014Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-level parallelism. However, this greedy policy usually causes serious cache contention on the SLLC and significantly degrades the system performance. It is therefore a critical performance factor that the thread scheduling of a throughput processor performs a careful trade-off between the thread-level parallelism and cache contention. This article characterizes and analyzes the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this article formally formulates the aggregate working-set-size-constrained thread scheduling problem that constrains the aggregate working-set size on concurrent threads. With a proof to be NP-hard, this article has integrated a series of algorithms to minimize the cache contention and enhance the overall system performance on GPGPUs. The simulation results on NVIDIA's Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.

References

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google Scholar
Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22^nd Annual International Conference on Supercomputing (ICS'08). 225--234. Google ScholarDigital Library
Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'07). Google ScholarDigital Library
Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput. 22, 462--478. Google ScholarDigital Library
Yangdong Deng, Bo David Wang, and Shuai Mu. 2009. Taming irregular EDA applications on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'09). 539--546. Google ScholarDigital Library
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). 353--364. Google ScholarDigital Library
Wu-Chun Feng and Shucai Xiao. 2010. To GPU synchronize or not GPU synchronize&quest; In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'10). 3801--3804.Google Scholar
Michael R. Garey, Ronald L. Graham, and Jeffery D. Ullman. 1972. Worst-case analysis of memory allocation algorithms. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (STOC'72). 143--150. Google ScholarDigital Library
Michael Garland and David B. Kirk. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarDigital Library
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of the Conference on Innovative Parallel Computing (InPar'12).Google Scholar
Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17, 606--618. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13). 395--406. Google ScholarDigital Library
Daniel R. Johnson, Matthew R. Johnson, John H. Kelm, William Tuohy, Steven S. Lumetta, and Sanjay J. Patel. 2011. Rigel: A 1,024-core single-chip accelerator architecture. IEEE Micro 31, 30--41. Google ScholarDigital Library
Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2012. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. http://www.cse.psu.edu/~oik5019/docs/pdf/NMNL-PACT2013.pdf. Google ScholarDigital Library
Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro. 31, 7--17. Google ScholarDigital Library
Khronos. 2011. The opencl specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.Google Scholar
Kenneth L. Krause, Vincent Y. Shen, and Herb D. Schwetman. 1975. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems. J. ACM. 22, 522--550. Google ScholarDigital Library
Hsien-Kai Kuo, Kuan-Ting Chen, Bo-Cheng Charles Lai, and Jing-Yang Jou. 2012. Thread affinity mapping for irregular data access on shared cache GPGPU. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASP-DAC'12). 659--664.Google Scholar
Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. 2014. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Trans. Comput. PP, 99, 1--1.Google Scholar
John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro 30, 56--69. Google ScholarDigital Library
Nvidia. 2012a. NVIDIA kepler compute architecture whitepaper. http://www.nvidia.com/object/nvidia-kepler.html.Google Scholar
Nvidia. 2012b. NVIDIA cuda C programming guide 4.1. https://developer.nvidia.com/cuda-toolkit-41-archive/.Google Scholar
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12). 72--83. Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13). 99--110. Google ScholarDigital Library
Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-Mei W. Hwu. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'08). 195--204. Google ScholarDigital Library
Jinuk Luke Shin, Kenway Tam, Dawei Huang, Bruce Petrick, Ha Pham, Changku Hwang, Hongping Li, Alan Smith, Timothy Johnson, Francis Schumacher, David Greenhill, Ana Sonia Leon, and Allan Strong. 2010. A 40nm 16-core 128-thread CMT SPARC SoC processor. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). 98--99.Google ScholarCross Ref
Craig M. Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi GF100 GPU architecture. IEEE Micro 31, 50--59. Google ScholarDigital Library
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.Google ScholarCross Ref
Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium Principles and Practice of Parallel Programming (PPoPP'13). 57--68. Google ScholarDigital Library
Jian Yang and Joseph Y.-T. Leung. 2003. The ordered open-end bin-packing problem. Oper. Res. 51, 759--770. Google ScholarDigital Library
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 86--97. Google ScholarDigital Library
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011a. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'11). 369--380. Google ScholarDigital Library
Yuanrui Zhang, Mahmut Kandemir, and Taylan Yemliha. 2011b. Studying inter-core data reuse in multicores. In Proceedings of the ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'11). 25--36. Google ScholarDigital Library
Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2012. The significance of CMP cache sharing on contemporary multithreaded applications. IEEE Trans. Parallel Distrib. Syst. 23, 367--374. Google ScholarDigital Library

Index Terms

Reducing Contention in Shared Last-Level Cache for Throughput Processors
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Managing shared last-level cache in a heterogeneous multicore processor
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important ...
Read More
Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache
Abstract
Although the shared last-level cache (SLLC) occupies a significant portion of multicore CPU chip die area, more than 59% of SLLC cache blocks are not reused during their lifetime. If we can filter out these useless blocks from SLLC, we can ...
Read More
Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Design Automation of Electronic Systems Volume 20, Issue 1
November 2014
377 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/2690851
Editors:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea
,
David Z. Pan,
Yuan Xie
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 18 November 2014
- Accepted: 1 June 2014
- Revised: 1 May 2014
- Received: 1 September 2013
Published in todaes Volume 20, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Throughput processors
cache contention
shared last-level cache
thread scheduling
thread-level parallelism
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 187
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reducing Contention in Shared Last-Level Cache for Throughput Processors

ACM Transactions on Design Automation of Electronic Systems

Abstract

References

Cited By

Index Terms

Recommendations

Managing shared last-level cache in a heterogeneous multicore processor

Filter cache: filtering useless cache blocks for a small but efficient shared last-level cache

Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors