skip to main content
10.1145/3466752.3480105acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources

Published: 17 October 2021 Publication History

Abstract

Many GPU applications issue irregular memory accesses to a very large memory footprint. We confirm observations from prior work that these irregular access patterns are severely bottlenecked by insufficient Translation Lookaside Buffer (TLB) reach, resulting in expensive page table walks. In this work, we investigate mechanisms to improve TLB reach without increasing the page size or the size of the TLB itself. Our work is based around the observation that a GPU’s instruction cache (I-cache) and Local Data Share (LDS) scratchpad memory are under-utilized in many applications, including those that suffer from poor TLB reach. We leverage this to opportunistically utilize idle capacity and port bandwidth from the GPU’s I-cache and LDS structures for address translations. We explore various potential architectural designs for each structure to optimize performance and minimize complexity. Both structures are organized as a victim cache between the L1 and L2 TLBs to boost translation reach. We find that our designs can increase performance on average by 30.1% without impacting the performance of applications that do not require additional reach.

References

[1]
Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cavé, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, Midhunchandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, Yves Vandriessche, and Izajasz P. Wrosz. 2020. PIUMA: Programmable Integrated Unified Memory Architecture. CoRR abs/2010.06277(2020). arxiv:2010.06277https://arxiv.org/abs/2010.06277
[2]
AMD. 2019. AMD Graphics Cores Next (GCN) Architecture.
[3]
AMD. 2019. GPU Programming Guide. https://developer.amd.com/wordpress/media/2013/10/si_programming_guide_v2.pdf.
[4]
AMD. 2019. RadeonTM RX580. https://www.amd.com/en/products/graphics/radeon-rx-580.
[5]
AMD. 2019. ROC Profiler Library. https://github.com/ROCm-Developer-Tools/rocprofiler.
[6]
AMD. 2020. AMD MI-50 Accelerator. https://www.amd.com/system/files/documents/radeon-instinct-mi50-datasheet.pdf
[7]
AMD. 2020. AMD MI-60 Accelerator. https://www.amd.com/system/files/documents/radeon-instinct-mi60-datasheet.pdf.
[8]
R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu. 2017. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9]
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[10]
Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don’T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA).
[11]
Centor For Efficient Exascale Discretizations (CEED). 2019. Laghos: High-order Langrangian Hydrodynamics Miniapp. https://github.com/CEED/Laghos.
[12]
S Chan. 2019. A Brief Intro to the Heterogeneous Compute Compiler. https://gpuopen.com/a-brief-intro-to-boltzmann-hcc/.
[13]
K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens. 2019. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. www.es.ele.tue.nl/drampower/.
[14]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC).
[15]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC).
[16]
CPU-World. 2019. AMD A12-Series PRO A12-8870. http://www.cpu-world.com/CPUs/Bulldozer/AMD-A12-Series%20PRO%20A12-8870.html.
[17]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. [n. d.]. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite.
[18]
Pat Flick. 2019. MIOpen Benchmarks. https://github.com/patflick/miopen-benchmark.
[19]
M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally. 2012. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[20]
gem5. 2019. The gem5 simulator. http://gem5.org.
[21]
K. Group. 2019. OpenCL. https://www.khronos.org/opengl/.
[22]
GUPS. 2019. GUPS. https://icl.utk.edu/projectsfiles/hpcc/RandomAccess//.
[23]
Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems. ACM Transactions on Architecture and Code Optimization (TACO) (2019).
[24]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826(2018).
[25]
Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-Aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications. In Proceedings of Workshop on General Purpose Processing Using GPUs (GPGPU).
[26]
Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI).
[27]
L.-N. Pouchet and T. Yuki. 2019. Polybench Suite. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/.
[28]
Lawrence Livermore National Laboratory. 2019. Quicksilver. https://github.com/LLNL/Quicksilver.
[29]
Los Alamos National Laboratory. 2019. KRIPKE. https://github.com/LLNL/Kripke.
[30]
Los Alamos National Laboratory. 2019. The PENNANT Mini-App. https://github.com/lanl/PENNANT.
[31]
X. Mei and X. Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72–86.
[32]
H. Naghibijouybari, K. N. Khasawneh, and N. Abu-Ghazaleh. 2017. Constructing and Characterizing Covert Channels on GPGPUs. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33]
Nvidia. 2019. GPU Applications. https://developer.nvidia.com/nccl.
[34]
Nvidia. 2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
[35]
Nvidia. 2019. Shared Memory. https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#shared-memory.
[36]
Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die-Stacked DRAM. In 2015 International Conference on Parallel Architecture and Compilation (PACT).
[37]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[38]
J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568–578. https://doi.org/10.1109/HPCA.2014.6835965
[39]
Sooraj Puthoor and Mikko H. Lipasti. 2018. Compiler Assisted Coalescing. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). Association for Computing Machinery.
[40]
Sooraj Puthoor, Xulong Tang, Joseph Gross, and Bradford M. Beckmann. 2018. Oversubscribed Command Queues in GPUs. In Proceedings of the 11th Workshop on General Purpose GPUs (GPGPU-11.
[41]
M. K. Qureshi and G. H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[42]
Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. 2017. Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA).
[43]
S. Shin, G. Cox, M. Oskin, G. H. Loh, Y. Solihin, A. Bhattacharjee, and A. Basu. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[44]
Seunghee Shin, Michael LeBeane, Yan Solihin, and Arkaprava Basu. 2018. Neighborhood-aware Address Translation for Irregular GPU Applications. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[45]
SR-IOV. [n. d.]. SR-IOV GPUs. https://www.amd.com/en/graphics/workstation-virtual-graphics. [Online; accessed February-02, 2021].
[46]
Xulong Tang, Ziyu Zhang, Weizheng Xu, Mahmut Taylan Kandemir, Rami Melhem, and Jun Yang. 2020. Enhancing Address Translations in Throughput Processors via Compression. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT).
[47]
J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[48]
Hongil Yoon, Jason Lowe-Power, and Gurindar S. Sohi. 2018. Filtering Translation Bandwidth with Virtual Caching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[49]
Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA).

Cited By

View all
  • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
  • (2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
  • (2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CPU+GPU Systems
  2. Irregular Applications
  3. Reconfigurable Systems
  4. Virtual Memory

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)11
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
  • (2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
  • (2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
  • (2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
  • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
  • (2022)Enhancing GPU Performance via Neighboring Directory Table Based Inter-TLB Sharing2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00031(146-153)Online publication date: Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media