ABSTRACT
In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation wait until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.
- Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143--152. Google ScholarCross Ref
- Massimo Bernaschi, Mauro Bisson, and Davide Rossetti. 2013. Benchmarking of communication techniques for GPUs. J. Parallel and Distrib. Comput. 73, 2 (2013), 250 -- 255. Google ScholarDigital Library
- Tanya Brokhman, Pavel Lifshits, and Mark Silberstein. 2019. GAIA: An OS Page Cache for Heterogeneous Systems. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 661--674. https://www.usenix.org/conference/atc19/presentation/brokhmanGoogle Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarDigital Library
- Jason Cong, Hui Huang, Chunyue Liu, and Yi Zou. 2011. A reuse-aware prefetching scheme for scratchpad memory. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). 960--965.Google ScholarDigital Library
- Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. 2005. Transformations to Parallel Codes for Communication-Computation Overlap. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 58--58. Google ScholarDigital Library
- Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. 2009. MPI-aware Compiler Optimizations for Improving Communication-computation Overlap. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM, New York, NY, USA, 316--325. Google ScholarDigital Library
- Lewis Fishgold, Anthony Danalis, Lori Pollock, and Martin Swany. 2006. An automated approach to improve communication-computation overlap in clusters. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 7 pp.--. Google ScholarCross Ref
- Free Software Foundation. 2019. mprotect(2) - Linux manual page. Website. (2019). http://man7.org/linux/man-pages/man2/mprotect.2.htmlGoogle Scholar
- Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10-12 (2010), 1254--1273. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.2462 Google ScholarCross Ref
- Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 52, 12 pages, http://dl.acm.org/citation.cfm?id=3014904.3014974Google ScholarCross Ref
- Mark Harris. 2012. How to Overlap Data Transfers in CUDA C/C++. Website. (2012). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/Google Scholar
- Mark Harris. 2017. Unified Memory for CUDA Beginners. Website. (2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners/Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770--778.Google Scholar
- Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2004. Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (DATE '04). IEEE Computer Society, Washington, DC, USA, 10202--. http://dl.acm.org/citation.cfm?id=968878.968995Google ScholarCross Ref
- Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A Data Reuse Analysis Technique for Efficient Scratch-pad Memory Management. ACM Trans. Des. Autom. Electron. Syst. 12, 2, Article 15 (April 2007). Google ScholarDigital Library
- Ali Khajeh-Saeed and J. Blair Perot. 2012. Computational Fluid Dynamics Simulations Using Many Graphics Processors. Computing in Science Engineering 14, 3 (May 2012), 10--19. Google ScholarDigital Library
- Ki-Hwan Kim and Q-Han Park. 2012. Overlapping computation and communication of three-dimensional FDTD on a GPU cluster. Computer Physics Communications 183, 11 (2012), 2364 -- 2369. Google ScholarCross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'12). Curran Associates Inc., USA, 1097--1105. http://dl.acm.org/citation.cfm?id=2999134.2999257Google ScholarDigital Library
- Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, and Martin Herbordt. 2014. An investigation of Unified Memory Access performance in CUDA. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. Google ScholarCross Ref
- Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092--1098. Google ScholarDigital Library
- Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 414--426. Google ScholarDigital Library
- NVIDIA. 2014. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110/210. Whitepaper. (2014). https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfGoogle Scholar
- NVIDIA. 2019. Artificial Intelligence Architecture | NVIDIA Volta. Website. (2019). https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/Google Scholar
- NVIDIA. 2019. CUDA Code Samples. Website. (2019). https://developer.nvidia.com/cuda-code-samplesGoogle Scholar
- NVIDIA. 2019. CUDA Parallel Computing Platform. Website. (2019). https://developer.nvidia.com/cuda-zoneGoogle Scholar
- NVIDIA. 2019. CUDA Runtime API: Memory Management. Website. (2019). https://docs.nvidia.com/cuda/cuda-runtime-api/group_CUDART_MEMORY.htmlGoogle Scholar
- NVIDIA. 2019. NVIDIA Driver Downloads. Website. (2019). https://www.nvidia.com/Download/index.aspxGoogle Scholar
- NVIDIA. 2019. Pascal GPU Architecture. Website. (2019). https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/Google Scholar
- NVIDIA. 2019. Professional Graphics Solution and Turing GPU Architecture | NVIDIA. Website. (2019). https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/Google Scholar
- NVIDIA. 2019. Unified Memory Programming. Website. (2019). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hdGoogle Scholar
- Everett H. Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--10. Google ScholarCross Ref
- James C. Phillips, John E. Stone, and Klaus Schulten. 2008. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1--9. Google ScholarCross Ref
- Berkeley AI Research. 2019. Caffe: Deep learning framework. Website. (2019). http://caffe.berkeleyvision.org/Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556 (09 2014).Google Scholar
- John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--9. Google ScholarCross Ref
- J. B. White III and J. J. Dongarra. 2011. Overlapping Computation and Communication for Advection on Hybrid Parallel Computers. In 2011 IEEE International Parallel Distributed Processing Symposium. 59--67. Google ScholarDigital Library
- Doran Wilde and Sanjay Rajopadhye. 1996. Memory reuse analysis in the polyhedral model. In Euro-Par'96 Parallel Processing, Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 389--397.Google Scholar
Index Terms
- Overlapping host-to-device copy and computation using hidden unified memory
Recommendations
SnuRHAC: A Runtime for Heterogeneous Accelerator Clusters with CUDA Unified Memory
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed ComputingThis paper proposes a framework called SnuRHAC, which provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code ...
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPCThe latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC ...
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUsThe CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
Comments