skip to main content
research-article

Reducing Contention in Shared Last-Level Cache for Throughput Processors

Published:18 November 2014Publication History
Skip Abstract Section

Abstract

Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-level parallelism. However, this greedy policy usually causes serious cache contention on the SLLC and significantly degrades the system performance. It is therefore a critical performance factor that the thread scheduling of a throughput processor performs a careful trade-off between the thread-level parallelism and cache contention. This article characterizes and analyzes the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this article formally formulates the aggregate working-set-size-constrained thread scheduling problem that constrains the aggregate working-set size on concurrent threads. With a proof to be NP-hard, this article has integrated a series of algorithms to minimize the cache contention and enhance the overall system performance on GPGPUs. The simulation results on NVIDIA's Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.

References

  1. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google ScholarGoogle Scholar
  2. Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput. 22, 462--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yangdong Deng, Bo David Wang, and Shuai Mu. 2009. Taming irregular EDA applications on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'09). 539--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wu-Chun Feng and Shucai Xiao. 2010. To GPU synchronize or not GPU synchronize? In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'10). 3801--3804.Google ScholarGoogle Scholar
  8. Michael R. Garey, Ronald L. Graham, and Jeffery D. Ullman. 1972. Worst-case analysis of memory allocation algorithms. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (STOC'72). 143--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Garland and David B. Kirk. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of the Conference on Innovative Parallel Computing (InPar'12).Google ScholarGoogle Scholar
  11. Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17, 606--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13). 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Daniel R. Johnson, Matthew R. Johnson, John H. Kelm, William Tuohy, Steven S. Lumetta, and Sanjay J. Patel. 2011. Rigel: A 1,024-core single-chip accelerator architecture. IEEE Micro 31, 30--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2012. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. http://www.cse.psu.edu/~oik5019/docs/pdf/NMNL-PACT2013.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro. 31, 7--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Khronos. 2011. The opencl specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.Google ScholarGoogle Scholar
  17. Kenneth L. Krause, Vincent Y. Shen, and Herb D. Schwetman. 1975. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems. J. ACM. 22, 522--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hsien-Kai Kuo, Kuan-Ting Chen, Bo-Cheng Charles Lai, and Jing-Yang Jou. 2012. Thread affinity mapping for irregular data access on shared cache GPGPU. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASP-DAC'12). 659--664.Google ScholarGoogle Scholar
  19. Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. 2014. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Trans. Comput. PP, 99, 1--1.Google ScholarGoogle Scholar
  20. John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro 30, 56--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nvidia. 2012a. NVIDIA kepler compute architecture whitepaper. http://www.nvidia.com/object/nvidia-kepler.html.Google ScholarGoogle Scholar
  22. Nvidia. 2012b. NVIDIA cuda C programming guide 4.1. https://developer.nvidia.com/cuda-toolkit-41-archive/.Google ScholarGoogle Scholar
  23. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12). 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13). 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-Mei W. Hwu. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'08). 195--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jinuk Luke Shin, Kenway Tam, Dawei Huang, Bruce Petrick, Ha Pham, Changku Hwang, Hongping Li, Alan Smith, Timothy Johnson, Francis Schumacher, David Greenhill, Ana Sonia Leon, and Allan Strong. 2010. A 40nm 16-core 128-thread CMT SPARC SoC processor. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). 98--99.Google ScholarGoogle ScholarCross RefCross Ref
  27. Craig M. Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi GF100 GPU architecture. IEEE Micro 31, 50--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.Google ScholarGoogle ScholarCross RefCross Ref
  29. Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium Principles and Practice of Parallel Programming (PPoPP'13). 57--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jian Yang and Joseph Y.-T. Leung. 2003. The ordered open-end bin-packing problem. Oper. Res. 51, 759--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011a. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'11). 369--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yuanrui Zhang, Mahmut Kandemir, and Taylan Yemliha. 2011b. Studying inter-core data reuse in multicores. In Proceedings of the ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'11). 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2012. The significance of CMP cache sharing on contemporary multithreaded applications. IEEE Trans. Parallel Distrib. Syst. 23, 367--374. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reducing Contention in Shared Last-Level Cache for Throughput Processors

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Design Automation of Electronic Systems
        ACM Transactions on Design Automation of Electronic Systems  Volume 20, Issue 1
        November 2014
        377 pages
        ISSN:1084-4309
        EISSN:1557-7309
        DOI:10.1145/2690851
        Issue’s Table of Contents

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 November 2014
        • Accepted: 1 June 2014
        • Revised: 1 May 2014
        • Received: 1 September 2013
        Published in todaes Volume 20, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader