ABSTRACT
Recent GPU architectures support unified virtual memory (UVM), which offers great opportunities to solve larger problems by memory oversubscription. Although some studies are concerned over the performance degradation under UVM oversubscription, the reasons behind workloads' diverse sensitivities to oversubscription is still unclear. In this work, we take the first step to select various benchmark applications and conduct rigorous experiments on their performance under different oversubscription ratios. Specifically,we take into account the variety of memory access patterns and explain applications' diverse sensitivities to oversubscription. We also consider prefetching and UVM hints, and discover their complex impact under different oversubscription ratios. Moreover, the strengths and pitfalls of UVM's multi-GPU support are discussed. We expect that this paper will provide useful experiences and insights for UVM system design.
- [n.d.]. CUDA C++ programming guide. http://docs.nvidia.com/cuda/cuda-cprogramming- guide/index.htmlGoogle Scholar
- [n.d.]. GCN architecture. https://www.amd.com/en/technologies/gcnGoogle Scholar
- [n.d.]. List of Nvidia Graphics Processing Units. https://en.wikipedia.org/wiki/ List_of_Nvidia_graphics_processing_unitsGoogle Scholar
- [n.d.]. NVIDIA Pascal GPU architecture. https://www.nvidia.com/en-us/datacenter/ pascal-gpu-architecture/Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: a benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54.Google Scholar
- Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R. Gao. 2010. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--12. https: //doi.org/10.1109/IPDPS.2010.5470413Google ScholarCross Ref
- Steven W. D. Chien, Ivy B. Peng, and Stefano Markidis. 2019. Performance evaluation of advanced features in CUDA unified memory. 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC) (2019), 50--57. https://doi.org/10.1109/MCHPC49590.2019.00014 arXiv:1910.09598Google ScholarCross Ref
- Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix Arizona). ACM, 224--235. https://doi.org/10.1145/3307650. 3322224Google ScholarDigital Library
- Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2020. Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 451--461. https://doi.org/10.1109/IPDPS47924.2020.00054Google ScholarCross Ref
- Yongbin Gu, Wenxuan Wu, Yunfan Li, and Lizhong Chen. 2020. UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs. arXiv preprint arXiv:2007.09822 (2020).Google Scholar
- Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. Batch-aware unified memory management in GPUs for irregular workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne Switzerland). ACM, 1357--1370. https://doi.org/10.1145/3373376.3378529Google ScholarDigital Library
- Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.Google ScholarCross Ref
- A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94--110. https://doi.org/10.1109/TPDS.2019.2928289Google ScholarDigital Library
- Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A framework for memory oversubscription management in graphics processing units. In Proceedings of the Twenty- Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA) (ASPLOS '19). Association for Computing Machinery, 49--63. https://doi.org/10.1145/3297858.3304044Google ScholarDigital Library
- Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. 2017. Multi-GPU Graph Analytics. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 479--490. https://doi.org/10.1109/IPDPS.2017.117Google Scholar
- Louis-Noël Pouchet et al. 2012. Polybench: The polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench 437 (2012).Google Scholar
- Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: transforming irregular Graphs for GPU-Friendly Graph Processing. ACM SIGPLAN Notices 53, 2 (2018), 622--636. https://doi.org/10.1145/3173162.3173180Google ScholarDigital Library
- Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS). 1068--1079. https://doi.org/10.1109/IPDPS.2011.102Google ScholarDigital Library
- Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi- GPU Performance Modeling and Optimization. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA '19). Association for Computing Machinery, New York, NY, USA, 197--209. https: //doi.org/10.1145/3307650.3322230Google ScholarDigital Library
- Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. SIGPLAN Not. 53, 1 (2018), 41--53. https://doi.org/10.1145/3200691.3178491Google ScholarDigital Library
- Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, and Minyi Guo. 2021. Grus: Toward Unified-memory-efficient High-performance Graph Processing on GPU. ACM Transactions on Architecture and Code Optimization (TACO) 18, 2 (2021), 1--25.Google ScholarDigital Library
- Pengyu Wang, Lu Zhang, Chao Li, and Minyi Guo. 2019. Excavating the potential of GPU for accelerating graph traversal. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 221--230.Google ScholarCross Ref
- Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: a High-Performance Graph Processing Library on the GPU. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2015-Janua (2015), 265--266. https:// doi.org/10.1145/2688500.2688538 arXiv:1501.05387v6Google ScholarDigital Library
- Hailu Xu, Murali Emani, Pei-Hung Lin, Liting Hu, and Chunhua Liao. 2019. Machine learning guided optimal use of GPU unified memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). IEEE, 64-- 70. https://doi.org/10.1109/MCHPC49590.2019.00016Google ScholarCross Ref
- Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards High Performance Paged Memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 345--357. https://doi.org/10.1109/HPCA.2016.7446077Google Scholar
Index Terms
- Oversubscribing GPU Unified Virtual Memory: Implications and Suggestions
Recommendations
Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory
ISCA '19: Proceedings of the 46th International Symposium on Computer ArchitectureMemory capacity in GPGPUs is a major challenge for data-intensive applications with their ever increasing memory requirement. To fit a workload into the limited GPU memory space, a programmer needs to manually divide the workload by tiling the working ...
An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
AbstractUnified virtual memory (UVM) improves GPU programmability by enabling on-demand data movement between CPU memory and GPU memory. However, due to the limited capacity of GPU device memory, oversubscription overhead becomes a major performance ...
Automatic memory-based vertical elasticity and oversubscription on cloud platforms
Hypervisors and Operating Systems support vertical elasticity techniques such as memory ballooning to dynamically assign the memory of Virtual Machines (VMs). However, current Cloud Management Platforms (CMPs), such as OpenNebula or OpenStack, do not ...
Comments