skip to main content
10.1145/3332466.3374531acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Artifacts Evaluated & Functional

Overlapping host-to-device copy and computation using hidden unified memory

Published:19 February 2020Publication History

ABSTRACT

In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation wait until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.

References

  1. Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143--152. Google ScholarGoogle ScholarCross RefCross Ref
  2. Massimo Bernaschi, Mauro Bisson, and Davide Rossetti. 2013. Benchmarking of communication techniques for GPUs. J. Parallel and Distrib. Comput. 73, 2 (2013), 250 -- 255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tanya Brokhman, Pavel Lifshits, and Mark Silberstein. 2019. GAIA: An OS Page Cache for Heterogeneous Systems. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 661--674. https://www.usenix.org/conference/atc19/presentation/brokhmanGoogle ScholarGoogle Scholar
  4. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jason Cong, Hui Huang, Chunyue Liu, and Yi Zou. 2011. A reuse-aware prefetching scheme for scratchpad memory. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). 960--965.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. 2005. Transformations to Parallel Codes for Communication-Computation Overlap. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 58--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. 2009. MPI-aware Compiler Optimizations for Improving Communication-computation Overlap. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM, New York, NY, USA, 316--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Lewis Fishgold, Anthony Danalis, Lori Pollock, and Martin Swany. 2006. An automated approach to improve communication-computation overlap in clusters. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 7 pp.--. Google ScholarGoogle ScholarCross RefCross Ref
  9. Free Software Foundation. 2019. mprotect(2) - Linux manual page. Website. (2019). http://man7.org/linux/man-pages/man2/mprotect.2.htmlGoogle ScholarGoogle Scholar
  10. Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10-12 (2010), 1254--1273. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.2462 Google ScholarGoogle ScholarCross RefCross Ref
  11. Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 52, 12 pages, http://dl.acm.org/citation.cfm?id=3014904.3014974Google ScholarGoogle ScholarCross RefCross Ref
  12. Mark Harris. 2012. How to Overlap Data Transfers in CUDA C/C++. Website. (2012). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/Google ScholarGoogle Scholar
  13. Mark Harris. 2017. Unified Memory for CUDA Beginners. Website. (2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners/Google ScholarGoogle Scholar
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770--778.Google ScholarGoogle Scholar
  15. Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2004. Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (DATE '04). IEEE Computer Society, Washington, DC, USA, 10202--. http://dl.acm.org/citation.cfm?id=968878.968995Google ScholarGoogle ScholarCross RefCross Ref
  16. Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A Data Reuse Analysis Technique for Efficient Scratch-pad Memory Management. ACM Trans. Des. Autom. Electron. Syst. 12, 2, Article 15 (April 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ali Khajeh-Saeed and J. Blair Perot. 2012. Computational Fluid Dynamics Simulations Using Many Graphics Processors. Computing in Science Engineering 14, 3 (May 2012), 10--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ki-Hwan Kim and Q-Han Park. 2012. Overlapping computation and communication of three-dimensional FDTD on a GPU cluster. Computer Physics Communications 183, 11 (2012), 2364 -- 2369. Google ScholarGoogle ScholarCross RefCross Ref
  19. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'12). Curran Associates Inc., USA, 1097--1105. http://dl.acm.org/citation.cfm?id=2999134.2999257Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, and Martin Herbordt. 2014. An investigation of Unified Memory Access performance in CUDA. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. Google ScholarGoogle ScholarCross RefCross Ref
  21. Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092--1098. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 414--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NVIDIA. 2014. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110/210. Whitepaper. (2014). https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfGoogle ScholarGoogle Scholar
  24. NVIDIA. 2019. Artificial Intelligence Architecture | NVIDIA Volta. Website. (2019). https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/Google ScholarGoogle Scholar
  25. NVIDIA. 2019. CUDA Code Samples. Website. (2019). https://developer.nvidia.com/cuda-code-samplesGoogle ScholarGoogle Scholar
  26. NVIDIA. 2019. CUDA Parallel Computing Platform. Website. (2019). https://developer.nvidia.com/cuda-zoneGoogle ScholarGoogle Scholar
  27. NVIDIA. 2019. CUDA Runtime API: Memory Management. Website. (2019). https://docs.nvidia.com/cuda/cuda-runtime-api/group_CUDART_MEMORY.htmlGoogle ScholarGoogle Scholar
  28. NVIDIA. 2019. NVIDIA Driver Downloads. Website. (2019). https://www.nvidia.com/Download/index.aspxGoogle ScholarGoogle Scholar
  29. NVIDIA. 2019. Pascal GPU Architecture. Website. (2019). https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/Google ScholarGoogle Scholar
  30. NVIDIA. 2019. Professional Graphics Solution and Turing GPU Architecture | NVIDIA. Website. (2019). https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/Google ScholarGoogle Scholar
  31. NVIDIA. 2019. Unified Memory Programming. Website. (2019). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hdGoogle ScholarGoogle Scholar
  32. Everett H. Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--10. Google ScholarGoogle ScholarCross RefCross Ref
  33. James C. Phillips, John E. Stone, and Klaus Schulten. 2008. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1--9. Google ScholarGoogle ScholarCross RefCross Ref
  34. Berkeley AI Research. 2019. Caffe: Deep learning framework. Website. (2019). http://caffe.berkeleyvision.org/Google ScholarGoogle Scholar
  35. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556 (09 2014).Google ScholarGoogle Scholar
  36. John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.Google ScholarGoogle Scholar
  37. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--9. Google ScholarGoogle ScholarCross RefCross Ref
  38. J. B. White III and J. J. Dongarra. 2011. Overlapping Computation and Communication for Advection on Hybrid Parallel Computers. In 2011 IEEE International Parallel Distributed Processing Symposium. 59--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Doran Wilde and Sanjay Rajopadhye. 1996. Memory reuse analysis in the polyhedral model. In Euro-Par'96 Parallel Processing, Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 389--397.Google ScholarGoogle Scholar

Index Terms

  1. Overlapping host-to-device copy and computation using hidden unified memory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        February 2020
        454 pages
        ISBN:9781450368186
        DOI:10.1145/3332466

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 February 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PPoPP '20 Paper Acceptance Rate28of121submissions,23%Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader