research-article

Overlapping host-to-device copy and computation using hidden unified memory

Authors:
Jaehoon Jung

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Daeyoung Park

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Youngdong Do

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Jungho Park

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

,
Jaejin Lee

Seoul National University, Seoul, Korea

Seoul National University, Seoul, Korea
View Profile

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingFebruary 2020Pages 321–335https://doi.org/10.1145/3332466.3374531

Published:19 February 2020Publication History

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 321–335

ABSTRACT

In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation wait until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.

References

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143--152. Google ScholarCross Ref
Massimo Bernaschi, Mauro Bisson, and Davide Rossetti. 2013. Benchmarking of communication techniques for GPUs. J. Parallel and Distrib. Comput. 73, 2 (2013), 250 -- 255. Google ScholarDigital Library
Tanya Brokhman, Pavel Lifshits, and Mark Silberstein. 2019. GAIA: An OS Page Cache for Heterogeneous Systems. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 661--674. https://www.usenix.org/conference/atc19/presentation/brokhmanGoogle Scholar
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarDigital Library
Jason Cong, Hui Huang, Chunyue Liu, and Yi Zou. 2011. A reuse-aware prefetching scheme for scratchpad memory. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). 960--965.Google ScholarDigital Library
Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. 2005. Transformations to Parallel Codes for Communication-Computation Overlap. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 58--58. Google ScholarDigital Library
Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. 2009. MPI-aware Compiler Optimizations for Improving Communication-computation Overlap. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM, New York, NY, USA, 316--325. Google ScholarDigital Library
Lewis Fishgold, Anthony Danalis, Lori Pollock, and Martin Swany. 2006. An automated approach to improve communication-computation overlap in clusters. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 7 pp.--. Google ScholarCross Ref
Free Software Foundation. 2019. mprotect(2) - Linux manual page. Website. (2019). http://man7.org/linux/man-pages/man2/mprotect.2.htmlGoogle Scholar
Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10-12 (2010), 1254--1273. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.2462 Google ScholarCross Ref
Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 52, 12 pages, http://dl.acm.org/citation.cfm?id=3014904.3014974Google ScholarCross Ref
Mark Harris. 2012. How to Overlap Data Transfers in CUDA C/C++. Website. (2012). https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/Google Scholar
Mark Harris. 2017. Unified Memory for CUDA Beginners. Website. (2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners/Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770--778.Google Scholar
Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2004. Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (DATE '04). IEEE Computer Society, Washington, DC, USA, 10202--. http://dl.acm.org/citation.cfm?id=968878.968995Google ScholarCross Ref
Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A Data Reuse Analysis Technique for Efficient Scratch-pad Memory Management. ACM Trans. Des. Autom. Electron. Syst. 12, 2, Article 15 (April 2007). Google ScholarDigital Library
Ali Khajeh-Saeed and J. Blair Perot. 2012. Computational Fluid Dynamics Simulations Using Many Graphics Processors. Computing in Science Engineering 14, 3 (May 2012), 10--19. Google ScholarDigital Library
Ki-Hwan Kim and Q-Han Park. 2012. Overlapping computation and communication of three-dimensional FDTD on a GPU cluster. Computer Physics Communications 183, 11 (2012), 2364 -- 2369. Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'12). Curran Associates Inc., USA, 1097--1105. http://dl.acm.org/citation.cfm?id=2999134.2999257Google ScholarDigital Library
Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, and Martin Herbordt. 2014. An investigation of Unified Memory Access performance in CUDA. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1--6. Google ScholarCross Ref
Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092--1098. Google ScholarDigital Library
Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 414--426. Google ScholarDigital Library
NVIDIA. 2014. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110/210. Whitepaper. (2014). https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfGoogle Scholar
NVIDIA. 2019. Artificial Intelligence Architecture | NVIDIA Volta. Website. (2019). https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/Google Scholar
NVIDIA. 2019. CUDA Code Samples. Website. (2019). https://developer.nvidia.com/cuda-code-samplesGoogle Scholar
NVIDIA. 2019. CUDA Parallel Computing Platform. Website. (2019). https://developer.nvidia.com/cuda-zoneGoogle Scholar
NVIDIA. 2019. CUDA Runtime API: Memory Management. Website. (2019). https://docs.nvidia.com/cuda/cuda-runtime-api/group_CUDART_MEMORY.htmlGoogle Scholar
NVIDIA. 2019. NVIDIA Driver Downloads. Website. (2019). https://www.nvidia.com/Download/index.aspxGoogle Scholar
NVIDIA. 2019. Pascal GPU Architecture. Website. (2019). https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/Google Scholar
NVIDIA. 2019. Professional Graphics Solution and Turing GPU Architecture | NVIDIA. Website. (2019). https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/Google Scholar
NVIDIA. 2019. Unified Memory Programming. Website. (2019). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hdGoogle Scholar
Everett H. Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--10. Google ScholarCross Ref
James C. Phillips, John E. Stone, and Klaus Schulten. 2008. Adapting a message-driven parallel application to GPU-accelerated clusters. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1--9. Google ScholarCross Ref
Berkeley AI Research. 2019. Caffe: Deep learning framework. Website. (2019). http://caffe.berkeleyvision.org/Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556 (09 2014).Google Scholar
John A. Stratton, Christopher Rodrigrues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--9. Google ScholarCross Ref
J. B. White III and J. J. Dongarra. 2011. Overlapping Computation and Communication for Advection on Hybrid Parallel Computers. In 2011 IEEE International Parallel Distributed Processing Symposium. 59--67. Google ScholarDigital Library
Doran Wilde and Sanjay Rajopadhye. 1996. Memory reuse analysis in the polyhedral model. In Euro-Par'96 Parallel Processing, Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 389--397.Google Scholar

Index Terms

Overlapping host-to-device copy and computation using hidden unified memory
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

SnuRHAC: A Runtime for Heterogeneous Accelerator Clusters with CUDA Unified Memory
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

This paper proposes a framework called SnuRHAC, which provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code ...
Read More
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC ...
Read More
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional
Author Tags
CUDA
GPU
data transfer and computation overlap
device driver
runtime
unified memory
Qualifiers
- research-article
Conference

Acceptance Rates
PPoPP '20 Paper Acceptance Rate28of121submissions,23%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 861
  Total Downloads
- Downloads (Last 12 months)106
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Overlapping host-to-device copy and computation using hidden unified memory

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

SnuRHAC: A Runtime for Heterogeneous Accelerator Clusters with CUDA Unified Memory

Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures