research-article

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

Authors:
Bo-Wun Cheng

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan
View Profile

,
En-Ming Huang

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan
View Profile

,
Chen-Hao Chao

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan
View Profile

,
Wei-Fang Sun

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan
View Profile

,
Tsung-Tai Yeh

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

National Yang Ming Chiao Tung University, Hsinchu, Taiwan
View Profile

,
Chun-Yi Lee

National Tsing Hua University, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan
View Profile

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation ConferenceJanuary 2023Pages 314–319https://doi.org/10.1145/3566097.3567838

Published:31 January 2023Publication History

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Pages 314–319

ABSTRACT

In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

References

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54.Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, United Kingdom) (MICRO-47). IEEE Computer Society, USA, 343--355.Google ScholarDigital Library
Bo-Wun Cheng, En-Ming Haung, Chen-Hao Chao, Wei-Fang Sun, Tsung-Tai Yeh, and Chun-Yi Lee. 2022. Remote Access Tag Array for Efficient GPU Intra-Cluster Data Sharing. In Proceedings of the 24th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI '22). 221--222.Google Scholar
Kyoshin Choo, William Panlener, and Byunghyun Jang. 2014. Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads. In 2014 IEEE 13th International Symposium on Parallel and Distributed Computing. 189--196.Google Scholar
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Cooperative Caching for GPUs. ACM Trans. Archit. Code Optim. 13, 4, Article 39 (dec 2016), 25 pages.Google ScholarDigital Library
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). 1--10.Google Scholar
Mohamed Assem Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2020. Analyzing and Leveraging Shared L1 Caches in GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT '20). Association for Computing Machinery, New York, NY, USA, 161--173.Google ScholarDigital Library
Mohamed Assem Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2021. Analyzing and Leveraging Decoupled L1 Caches in GPUs. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 467--478.Google ScholarCross Ref
Mohamed Assem Ibrahim, Hongyuan Liu, Onur Kayiran, and Adwait Jog. 2019. Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 258--271.Google Scholar
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS '13). Association for Computing Machinery, New York, NY, USA, 395--406.Google Scholar
Aajna Karki, Chethan Palangotu Keshava, Spoorthi Mysore Shivakumar, Joshua Skow, Goutam Madhukeshwar Hegde, and Hyeran Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (Providence, RI, USA) (GPGPU '19). Association for Computing Machinery, New York, NY, USA, 12--21.Google ScholarDigital Library
Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166.Google ScholarDigital Library
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 473--486.Google Scholar
Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, New York, NY, USA, 307--319.Google ScholarDigital Library
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 487--498.Google ScholarDigital Library
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 67--77.Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 3--14.Google ScholarDigital Library
NVIDIA. 2020. NVIDIA AMPERE GA102 GPU ARCHITECTURE. Retrieved July 27, 2022 from https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdfGoogle Scholar
Yunho Oh, Gunjae Koo, Murali Annavaram, and Won Woo Ro. 2019. Linebacker: Preserving Victim Cache Lines in Idle Register Files of GPUs. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA '19). Association for Computing Machinery, New York, NY, USA, 183--196.Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 72--83.Google Scholar
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2013. Divergence-Aware Warp Scheduling. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 99--110.Google Scholar
David Tarjan and Kevin Skadron. 2010. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, USA, 1--10.Google ScholarDigital Library
Jianfei Wang, Li Jiang, Jing Ke, Xiaoyao Liang, and Naifeng Jing. 2019. A Sharing-Aware L1.5D Cache for Data Reuse in GPGPUs. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (Tokyo, Japan) (ASPDAC '19). Association for Computing Machinery, New York, NY, USA, 388--393.Google ScholarDigital Library

Index Terms

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

Index terms have been assigned to the content through auto-classification.

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Read More
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES
Read More
Effective cache prefetching on bus-based multiprocessors

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference
January 2023
807 pages
ISBN:9781450397834
DOI:10.1145/3566097
General Chair:
Atsushi Takahashi
Tokyo Institute of Technology
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
ASPDAC '23 Paper Acceptance Rate102of328submissions,31%Overall Acceptance Rate466of1,454submissions,32%
More
Upcoming Conference
ASPDAC '25

Sponsor:

sigda

30th Asia and South Pacific Design Automation Conference

January 20 - 23, 2025

Tokyo , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 99
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES

Effective cache prefetching on bus-based multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES

Effective cache prefetching on bus-based multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media