skip to main content
10.1145/3307650.3322224acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory

Published: 22 June 2019 Publication History

Abstract

Memory capacity in GPGPUs is a major challenge for data-intensive applications with their ever increasing memory requirement. To fit a workload into the limited GPU memory space, a programmer needs to manually divide the workload by tiling the working set and perform user-level data migration. To relieve the programmer from this burden, Unified Virtual Memory (UVM) was developed to support on-demand paging and migration, transparent to the user. It further takes care of the memory over-subscription issue by automatically performing page replacement in an oversubscribed GPU memory situation. However, we found that naïve handling of page faults can cause orders of magnitude slowdown in performance. Moreover, we observed that although prefetching of data from CPU to GPU can hide the page fault latency, the difference among various prefetching mechanisms can lead to drastically different performance results. To this end, we performed extensive experiments on GeForceGTX 1080ti GPUs with PCI-e 3.0 16x to discover that there exists an effective prefetch mechanism to enhance locality in GPU memory. However, as the GPU memory is filled to its capacity, such prefetching mechanism quickly proves to be counterproductive due to locality unaware eviction policy. This necessitates the design of new eviction policies that are aware of the hardware prefetcher semantics. We propose two new programmer-agnostic, locality-aware pre-eviction policies which leverage the mechanics of existing hardware prefetcher and thus incur no additional implementation and performance overhead. We demonstrate that combining the proposed tree-based pre-eviction policy with the hardware prefetcher provides an average of 93% and 18.5% performance speed-up compared to LRU based 4KB and 2MB page replacement strategies, respectively. We further examine the memory access pattern of GPU workloads under consideration to analyze the achieved performance speed-up.

References

[1]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 607--618.
[2]
AMD. 2017. Radeons Next-generation Vega Architecture. https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf. (2017).
[3]
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 136--150.
[4]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.
[5]
A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62--63.
[6]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.
[7]
Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinmana. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In HPCA.
[8]
Debashis Ganguly. 2019. GPGPU-Sim UVM Smart. https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart.git. (2019).
[9]
Jayneel Gandhi, Arkaprava Basu, Mark D Hill, and Michael M Swift. 2014. Efficient memory virtualization: Reducing dimensionality of nested page walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 178--189.
[10]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--10.
[11]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406.
[12]
Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 332--343.
[13]
Teresa L Johnson, Matthew C Merten, and Wen-Mei W Hwu. 1997. Run-time spatial locality detection and optimization. In Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on. IEEE, 57--64.
[14]
Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap:Enabling Over-subscription of GPU Memory Through Transparent Swapping. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15). ACM, New York, NY, USA, 65--77.
[15]
Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke. 2014. VAST: The illusion of a large memory space for GPUs. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on. IEEE, 443--454.
[16]
Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63.
[17]
Haikun Liu, Hai Jin, Xiaofei Liao, Wei Deng, Bingsheng He, and Cheng-zhong Xu. 2015. Hotplug or ballooning: A comparative study on dynamic memory management techniques for virtual machines. IEEE Transactions on parallel and distributed systems 26, 5 (2015), 1350--1363.
[18]
NVIDIA. 2018. CUDA Runtime API - v10.0.130. https://docs.nvidia.com/cuda/cuda-runtime-api/. (2018). Accessed Sep 26, 2018.
[19]
NVIDIA. 2018. NVIDIA Pascal Architecture. https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/. (2018). Accessed Sep 26, 2018.
[20]
NVIDIA Corp. 2011. CUDA Toolkit 4.0. https://developer.nvidia.com/cuda-toolkit-40. (2011).
[21]
NVIDIA Corp. 2014. NVIDIA GeForce GTX 750 Ti. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf. (2014).
[22]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. ACM SIGPLAN Notices 49, 4 (2014), 743--758.
[23]
Nikolay Sakharnykh. 2017. Unified Memory on Pascal and Volta. http://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf. (2017). Accessed Sep 26, 2018.
[24]
Nikolay Sakharnykh. 2018. Everything you need to know about Unified Memory. http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf. (2018). Accessed Sep 26, 2018.
[25]
Seunghee Shin, Guilherme Cox, Mark Oskin, Gabriel H. Loh, Yan Solihin, Abhishek Bhattacharjee, and Arkaprava Basu. 2018. Scheduling Page Table Walks for Irregular GPU Applications (ISCA '18).
[26]
Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 345--357.

Cited By

View all
  • (2024)vSoC: Efficient Virtual System-on-Chip on Heterogeneous HardwareProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695946(558-573)Online publication date: 4-Nov-2024
  • (2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
  • (2024)Selective Memory Compression for GPU Memory Oversubscription ManagementProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673058(189-198)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
June 2019
849 pages
ISBN:9781450366694
DOI:10.1145/3307650
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. hardware prefetcher
  3. page eviction policy
  4. unified virtual memory

Qualifiers

  • Research-article

Conference

ISCA '19
Sponsor:

Acceptance Rates

ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)349
  • Downloads (Last 6 weeks)31
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)vSoC: Efficient Virtual System-on-Chip on Heterogeneous HardwareProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695946(558-573)Online publication date: 4-Nov-2024
  • (2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
  • (2024)Selective Memory Compression for GPU Memory Oversubscription ManagementProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673058(189-198)Online publication date: 12-Aug-2024
  • (2024)Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page TableProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676900(325-337)Online publication date: 14-Oct-2024
  • (2024)Shared Virtual Memory: Its Design and Performance Implications for Diverse ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656608(26-37)Online publication date: 30-May-2024
  • (2024)Synergizing CXL with Unified Memory for Scalable GPU Memory Expansion2024 International Conference on Electronics, Information, and Communication (ICEIC)10.1109/ICEIC61013.2024.10457110(1-4)Online publication date: 28-Jan-2024
  • (2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
  • (2024)Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00021(139-155)Online publication date: 2-Mar-2024
  • (2023)UVMMU: Hardware-Offloaded Page Migration for Heterogeneous Computing2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137307(1-6)Online publication date: Apr-2023
  • (2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media