research-article

Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources

Authors:

Jagadish B. Kotra,

Michael LeBeane,

Mahmut T. Kandemir,

Gabriel H. LohAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1169 - 1181

https://doi.org/10.1145/3466752.3480105

Published: 17 October 2021 Publication History

Abstract

Many GPU applications issue irregular memory accesses to a very large memory footprint. We confirm observations from prior work that these irregular access patterns are severely bottlenecked by insufficient Translation Lookaside Buffer (TLB) reach, resulting in expensive page table walks. In this work, we investigate mechanisms to improve TLB reach without increasing the page size or the size of the TLB itself. Our work is based around the observation that a GPU’s instruction cache (I-cache) and Local Data Share (LDS) scratchpad memory are under-utilized in many applications, including those that suffer from poor TLB reach. We leverage this to opportunistically utilize idle capacity and port bandwidth from the GPU’s I-cache and LDS structures for address translations. We explore various potential architectural designs for each structure to optimize performance and minimize complexity. Both structures are organized as a victim cache between the L1 and L2 TLBs to boost translation reach. We find that our designs can increase performance on average by 30.1% without impacting the performance of applications that do not require additional reach.

References

[1]

Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cavé, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, Midhunchandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, Yves Vandriessche, and Izajasz P. Wrosz. 2020. PIUMA: Programmable Integrated Unified Memory Architecture. CoRR abs/2010.06277(2020). arxiv:2010.06277https://arxiv.org/abs/2010.06277

[2]

AMD. 2019. AMD Graphics Cores Next (GCN) Architecture.

[3]

AMD. 2019. GPU Programming Guide. https://developer.amd.com/wordpress/media/2013/10/si_programming_guide_v2.pdf.

[4]

AMD. 2019. RadeonTM RX580. https://www.amd.com/en/products/graphics/radeon-rx-580.

[5]

AMD. 2019. ROC Profiler Library. https://github.com/ROCm-Developer-Tools/rocprofiler.

[6]

AMD. 2020. AMD MI-50 Accelerator. https://www.amd.com/system/files/documents/radeon-instinct-mi50-datasheet.pdf

[7]

AMD. 2020. AMD MI-60 Accelerator. https://www.amd.com/system/files/documents/radeon-instinct-mi60-datasheet.pdf.

[8]

R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu. 2017. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]

Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[10]

Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don’T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[11]

Centor For Efficient Exascale Discretizations (CEED). 2019. Laghos: High-order Langrangian Hydrodynamics Miniapp. https://github.com/CEED/Laghos.

[12]

S Chan. 2019. A Brief Intro to the Heterogeneous Compute Compiler. https://gpuopen.com/a-brief-intro-to-boltzmann-hcc/.

[13]

K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens. 2019. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. www.es.ele.tue.nl/drampower/.

[14]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC).

[15]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]

CPU-World. 2019. AMD A12-Series PRO A12-8870. http://www.cpu-world.com/CPUs/Bulldozer/AMD-A12-Series%20PRO%20A12-8870.html.

[17]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. [n. d.]. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite.

[18]

Pat Flick. 2019. MIOpen Benchmarks. https://github.com/patflick/miopen-benchmark.

[19]

M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally. 2012. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[20]

gem5. 2019. The gem5 simulator. http://gem5.org.

[21]

K. Group. 2019. OpenCL. https://www.khronos.org/opengl/.

[22]

GUPS. 2019. GUPS. https://icl.utk.edu/projectsfiles/hpcc/RandomAccess//.

[23]

Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems. ACM Transactions on Architecture and Code Optimization (TACO) (2019).

Digital Library

[24]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826(2018).

[25]

Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-Aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications. In Proceedings of Workshop on General Purpose Processing Using GPUs (GPGPU).

[26]

Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI).

Digital Library

[27]

L.-N. Pouchet and T. Yuki. 2019. Polybench Suite. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/.

[28]

Lawrence Livermore National Laboratory. 2019. Quicksilver. https://github.com/LLNL/Quicksilver.

[29]

Los Alamos National Laboratory. 2019. KRIPKE. https://github.com/LLNL/Kripke.

[30]

Los Alamos National Laboratory. 2019. The PENNANT Mini-App. https://github.com/lanl/PENNANT.

[31]

X. Mei and X. Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72–86.

Digital Library

[32]

H. Naghibijouybari, K. N. Khasawneh, and N. Abu-Ghazaleh. 2017. Constructing and Characterizing Covert Channels on GPGPUs. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]

Nvidia. 2019. GPU Applications. https://developer.nvidia.com/nccl.

[34]

Nvidia. 2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.

[35]

Nvidia. 2019. Shared Memory. https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#shared-memory.

[36]

Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die-Stacked DRAM. In 2015 International Conference on Parallel Architecture and Compilation (PACT).

Digital Library

[37]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[38]

J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568–578. https://doi.org/10.1109/HPCA.2014.6835965

[39]

Sooraj Puthoor and Mikko H. Lipasti. 2018. Compiler Assisted Coalescing. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). Association for Computing Machinery.

Digital Library

[40]

Sooraj Puthoor, Xulong Tang, Joseph Gross, and Bradford M. Beckmann. 2018. Oversubscribed Command Queues in GPUs. In Proceedings of the 11th Workshop on General Purpose GPUs (GPGPU-11.

Digital Library

[41]

M. K. Qureshi and G. H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[42]

Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. 2017. Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[43]

S. Shin, G. Cox, M. Oskin, G. H. Loh, Y. Solihin, A. Bhattacharjee, and A. Basu. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[44]

Seunghee Shin, Michael LeBeane, Yan Solihin, and Arkaprava Basu. 2018. Neighborhood-aware Address Translation for Irregular GPU Applications. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[45]

SR-IOV. [n. d.]. SR-IOV GPUs. https://www.amd.com/en/graphics/workstation-virtual-graphics. [Online; accessed February-02, 2021].

[46]

Xulong Tang, Ziyu Zhang, Weizheng Xu, Mahmut Taylan Kandemir, Rami Melhem, and Jun Yang. 2020. Enhancing Address Translations in Throughput Processors via Compression. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[47]

J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[48]

Hongil Yoon, Jason Lowe-Power, and Gurindar S. Sohi. 2018. Filtering Translation Bandwidth with Virtual Caching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[49]

Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA).

Cited By

Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Park JKwon OLee YKim SByeon GYoon JNair PHong S(2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00029
Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605593
Show More Cited By

Recommendations

Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
695
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)11

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Park JKwon OLee YKim SByeon GYoon JNair PHong S(2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00029
Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605593
Li RYadav SWu QKavi KMehta GYadwadkar NJohn L(2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00024
Lee JLee JOh YSong WRo W(2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071063
Du YLiu MYang YZhang MTang X(2022)Enhancing GPU Performance via Neighboring Directory Table Based Inter-TLB Sharing2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00031(146-153)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00031

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten