Memory-aware TLP throttling and cache bypassing for GPUs

Zhang, Jun; He, Yanxiang; Shen, Fanfan; Li, Qing’an; Tan, Hai

doi:10.1007/s10586-017-1396-0

Memory-aware TLP throttling and cache bypassing for GPUs

Published: 27 November 2017

Volume 22, pages 871–883, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Jun Zhang^1,2,
Yanxiang He^1,3,
Fanfan Shen¹,
Qing’an Li^1,3 &
…
Hai Tan²

464 Accesses
2 Citations
Explore all metrics

Abstract

General-purpose graphics processing unit (GPGPU) has become one of the most important high performance platforms oriented to high throughput applications. However, on-chip resources contention can often occur as there are large amounts of concurrent running threads inside GPGPU. It has become an important factor affecting the performance of GPGPUs. We propose memory-aware TLP throttling and cache bypassing (MATB) mechanism, which can exploit data temporal locality and memory bandwidth. It aims to make those cache blocks with good data locality stay inside L1D cache longer while maintaining on-chip resources utiliza- tion. On one hand, it can alleviate cache contention via limiting the memory warps with bad data reuse to be scheduled while cache contention and on-chip network congestion occur. On the other hand, it can make memory bandwidth be utilized more effectively via cache bypassing. Experimental results show MATB can achieve 26.6% and 14.2% performance improvement respectively on average relative to GTO and DYNCTA with low hardware cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

References

Nvidia C: Nvidia’s next generation cuda compute architecture: Fermi. Comput. Syst., 26, 63–72 (2009)
Nvidia C: Nvidia’s next generation cuda compute architecture: Kepler gk110. Whitepaper (2012)
Luebke, D., Humphreys, G.: How GPUs work. IEEE Comput. 40(2), 96–100 (2007)
Article Google Scholar
Montrym, J., Moreton, H.: The Geforce 6800. IEEE Micro 25(2), 41–51 (2005)
Article Google Scholar
Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)
Article Google Scholar
Lindholm, E., Nickolls, J., Oberman, S., et al.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)
Article Google Scholar
He, Y., Zhang, J., Shen, F., et al.: Thread scheduling optimization of general purpose Graphics Processing Unit: a survey. Chinese J. Comput. 39(9), 1–17 (2016)
MathSciNet Google Scholar
Corparation AMD: ATI stream computing OpenCL programming guide (2010)
Xiang, P., Yang, Y., Zhou, H.: Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 284–295 (2014)
Li, D.: Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors. Dissertation for Ph.D. Degree. USA: The University of Texas at Austin, pp. 4–6 (2014)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wave- front scheduling. Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Canada, pp. 72–83, (2012)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Divergence-aware warp scheduling. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, USA, pp. 99–110, (2013)
Kayiran, O., Jog, A., Kandemir, M.T., et al.: Neither more nor less: optimizing thread-level parallelism for gpgpus. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, United Kingdom, pp. 157–166 (2013)
Li, C., Yang, Y., Dai, H., et al.: Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 231–242 (2014)
Kim, K., Lee, S., Yoon, M.K., et al.: Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding. Proceedings of the 22th International Symposium on High Performance Computer Architecture, Spain, pp. 163–165 (2016)
Che, S., Boyer, M., Meng, J., et al.: Rodinia: A benchmark suite for heterogeneous computing. Proceedings of the 2009 IEEE International Symposium on Workload Characterization. USA, pp. 44–54 (2009)
Bakhoda, A., Yuan, G., Fung, W.L., et al.: Analyzing CUDA workloads using a detailed GPU simulator. Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, USA, pp. 163–174 (2009)
NVIDIA C: C Programming Guide: Design Guide. PG-02829-001 v6.5, NVIDIA, Santa Clara, Calif, USA (2014)
NVIDIA CUDA Team: NVIDIA compute PTX: Parallel thread execution. ISA version (2009)
Johnson, T.L., Hwu, W.M.W.: Run-time adaptive cache hierarchy management via reference analysis. Proceedings of the 24th International Symposium on Computer Architecture, Denver, Colorado, USA, pp. 315–326 (1997)
Kharbutli, M., Solihin, Y.: Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57(4), 433–447 (2008)
Article MathSciNet Google Scholar
Duong, N., Zhao, D., Kim, T., et al.: Improving cache management policies using dynamic reuse distances. Proceedings of 45th Annual IEEE/ACM International Symposium on, Canada, pp. 389–400 (2012)
Gaur, J., Chaudhuri, M., Subramoney, S.: Bypass and insertion algorithms for exclusive last-level caches. Proceedings of 38th International Sympo- sium on Computer Architecture, USA, pp. 81–92 (2011)
Chen, X., Chang, L.W., Rodrigues, C.I., et al.: Adaptive cache management for energy-efficient gpu computing. Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, United Kingdom, pp. 343–355 (2014)
Tian, Y., Puthoor, S., Greathouse, J.L., et al.: Adaptive GPU cache bypassing. Proceedings of the 8th Workshop on General Purpose Processing using GPUS, USA, pp. 25–35 (2015)
Li, C., Song, S.L., Dai, H., et al.: Locality-driven dynamic GPU cache bypassing. Proceedings of the 29th ACM on International Conference on Supercomputing, USA, pp. 67–77 (2015)
Xie, X., Liang, Y., Wang, Y., et al.: Coordinated static and dynamic cache bypassing for GPUs. Proceedings of 21st International Symposium on High Performance Computer Architecture, USA, pp. 76–88 (2015)
Lee, S.Y., Wu, C.J.: Ctrl-C: Instruction-aware control loop based adaptive cache bypassing for GPUs. Proceedings of 34th International Conference on Computer Design, USA, pp. 133–140 (2016)
Zheng, Z., Wang, Z., Lipasti, M.: Adaptive cache and concurrency allocation on GPGPUs. IEEE Comput. Archit. Lett. 14(2), 90–93 (2015)
Article Google Scholar
Lee, M., Song, S., Moon, J., et al.: Improving GPGPU resource utilization through alternative thread block scheduling. Proceedings of the 20th International Symposium on High Performance Computer Architecture, USA, pp. 260–271 (2014)
Yoon, M.K., Kim, K., Lee, S., et al.: Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. Proceedings of the 43rd Annual International Symposium on Computer Architecture, South Korea, pp. 609–621 (2016)
Xie, X., Liang, Y., Li, X., et al.: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. Proceedings of the 48th International Symposium on Microarchitecture, USA, pp. 395–406 (2015)
Adriaens, J.T., Compton, K., Kim, N.S., et al.: The case for GPGPU spatial multitasking. Proceedings of the 18th International Symposium on High Performance Computer Architecture, USA, pp. 1–12 (2012)
Zhong, J., He, B.: Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)
Article MathSciNet Google Scholar
Xu, Q., Jeon, H., Kim, K., et al.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramm- ing. Proceedings of the 43rd International Symposium on Computer Architecture, South Korea, pp. 230–242 (2016)
Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 527–540 (2017)
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: Collaborative preemption for multitasking on a shared GPU. Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Turkey, pp. 593–606 (2015)
Tanasic, I., Gelado, I., Cabezas, J., et al.: Enabling preemptive multiprogramming on GPUs. Proceedings of 41st International Symposium on Computer Architecture, USA, pp. 193–204 (2014)
Wu, B., Liu, X., Zhou, X., et al.: FLEP: Enabling flexible and efficient preemption on GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, China, pp. 483–496 (2017)
Wang, Z., Yang, J., Melhem, R., et al.: Simultaneous multikernel GPU: multi-tasking throughput processors via fine-grained sharing, pp. 358–369. Proceedings of International Symposium on High Performance Computer Architecture, Spain (2016)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the reviewers for their worthy suggestions that help to improve this work greatly. This work was partially supported by the National Natural Science Foundation of China [Project Nos. 61373039, 61662002, 61462004], and the Natural Science Foundation of Jiangxi Province, China[Project No. 20151BAB207042, 20161BAB212056], and the Key Research and Development Plan of the Scientific Department in Jiangxi Province, China (No. 20161BBE50063), and the Science and Technology Project of the Education Department in Jiangxi Province, China[Project No. GJJ150605]. Yanxiang He is the corresponding author.

Author information

Authors and Affiliations

Computer School, Wuhan University, Wuhan, China
Jun Zhang, Yanxiang He, Fanfan Shen & Qing’an Li
School of Software, East China University of Technology, Nanchang, China
Jun Zhang & Hai Tan
State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China
Yanxiang He & Qing’an Li

Authors

Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanxiang He
View author publications
You can also search for this author in PubMed Google Scholar
Fanfan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Qing’an Li
View author publications
You can also search for this author in PubMed Google Scholar
Hai Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., He, Y., Shen, F. et al. Memory-aware TLP throttling and cache bypassing for GPUs. Cluster Comput 22 (Suppl 1), 871–883 (2019). https://doi.org/10.1007/s10586-017-1396-0

Download citation

Received: 12 September 2017
Revised: 14 November 2017
Accepted: 16 November 2017
Published: 27 November 2017
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-017-1396-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory-aware TLP throttling and cache bypassing for GPUs

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Memory-aware TLP throttling and cache bypassing for GPUs

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation