Skip to main content
Log in

Transparent partial page migration between CPU and GPU

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Despite the increasing investment in integrated GPU and next-generation interconnect research, discrete GPU connected by PCIe still account for the dominant position of the market, the management of data communication between CPU and GPU continues to evolve. Initially, the programmer explicitly controls the data transfer between CPU and GPU. To simplify programming and enable system-wide atomic memory operations, GPU vendors have developed a programming model that provides a single, virtual address space for accessing all CPU and GPU memories in the system. The page migration engine in this model automatically migrates pages between CPU and GPU on demand. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnects compared to GDDR, larger page migration has longer delay, which may reduce the overlap of computation and transmission, waste time to migrate unrequested data, block subsequent requests, and cause serious performance decline. In this paper, we propose partial page migration that only migrates the requested part of a page to reduce the migration unit, shorten the migration latency, and avoid the performance degradation of the full page migration when the page becomes larger. We show that partial page migration is possible to largely hide the performance overheads of full page migration. Compared with programmer controlled data transmission, when the page size is 2MB and the PCIe bandwidth is 16GB/sec, full page migration is 72.72× slower, while our partial page migration achieves 1.29× speedup. When the PCIe bandwidth is changed to 96GB/sec, full page migration is 18.85× slower, while our partial page migration provides 1.37× speedup. Additionally, we examine the performance impact that PCIe bandwidth and migration unit size have on execution time, enabling designers to make informed decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Harris M. Unified memory in CUDA 6. GTC On-Demand, NVIDIA, 2013

  2. Lindholm E, Nickolls J, Oberman S, Montrym J. Nvidia tesla: a unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55

    Article  Google Scholar 

  3. Di Carlo S, Gambardella G, Martella I, Prinetto P, Rolfo D, Trotta P. Fault mitigation strategies for CUDA GPUs. In: Proceedings of IEEE International Test Conference. 2013, 1–8

  4. Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2014, 568–578

  5. Landaverde R, Zhang T, Coskun A K, Herbordt M. An investigation of unified memory access performance in CUDA. In: Proceedings of IEEE High Performance Extreme Computing Conference. 2014, 1–6

  6. Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler S W. Towards high performance paged memory for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2016, 345–357

  7. Lustig D, Martonosi M. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2013, 354–365

  8. Kirk D. Nvidia CUDA software and GPU parallel computing architecture. In: Proceedings of International Symposium on Memory Management. 2007, 103–104

  9. Patterson D. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Nvidia Whitepaper, 2009, 47

  10. Hammarlund P, Martinez A J, Bajwa A A, Hill D L, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne R B, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdun S, Gunther S, Piazza T, Burton T. Haswell: the fourth-generation intel core processor. IEEE Micro, 2014, 34(2): 6–20

    Article  Google Scholar 

  11. Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GGGPU processing in CUDA architecture. Advanced Computing, 2012, 3(1): 105

    Google Scholar 

  12. Rogers P. Heterogeneous system architecture overview. In: Proceedings of Hot Chip: A Symposium on High Performance Chips. 2013

  13. Kim Y, Lee J, Kim D, Kim J. Scalegpu: GPU architecture for memory-unaware GPU programming. IEEE Computer Architecture Letters, 2014, 13(2): 101–104

    Article  Google Scholar 

  14. Cao Y, Chen L, Zhang Z. Flexible memory: a novel main memory architecture with block-level memory compression. In: Proceedings of IEEE International Conference on Networking, Architecture and Storage. 2015, 285–294

  15. Agarwal N, Nellans D, Stephenson M, O’Connor M, Keckler S W. Page placement strategies for GPUs within heterogeneous memory systems. ACM SIGPLAN Notices, 2015, 50: 607–618

    Article  Google Scholar 

  16. Agarwal N, Nellans D, O’Connor M, Keckler S W, Wenisch T F. Unlocking bandwidth for GPUs in CC-NUMA systems. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 354–365

  17. Keckler S W, Dally W J, Khailany B, Garland M, Glasco D. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7–17

    Article  Google Scholar 

  18. Vesely J, Basu A, Oskin M, Loh G H, Bhattacharjee A. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161–171

  19. Awasthi M, Nellans D, Sudan K, Balasubramonian R, Davis A. Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques. 2010, 319–330

  20. Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A K, Kandemir M T, Mutlu O, Das C R. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceedings of International Conference on Parallel Architecture and Compilation Techniques. 2016, 31–44

  21. Chan C, Didem Unat D, Lijewski M, Zhang W, Bell J, Shalf J. Software design space exploration for exascale combustion co-design. In: Proceedings of International Supercomputing Conference. 2013, 196–212

  22. Dashti M, Fedorova A. Analyzing memory management methods on integrated CPU-GPU systems. In: Proceedings of ACM SIGPLAN International Symposium on Memory Management. 2017, 59–69

    Article  Google Scholar 

  23. Bakhoda A, Yuan G L, Fung W W L, Wong H, Aamodt T M. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163–174

  24. Aamodt T M, Fung W W L, Singh I, El-Shafiey A, Kwa J, Hetherington T, Gubran A, Boktor A, Rogers T, Bakhoda A. GPGPU-Sim 3.x manual. Retrieved February, 2013, 1: 2015

    Google Scholar 

  25. Ajanovic J. PCI express 3.0 overview. In: Proceedings of Hot Chips: A Symposium on High Performance Chips. 2009

  26. Gonzales D. PCI express 4.0 electrical previews. In: Proceedings of PCI-SIG Developers Conference. 2015

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable feedback. This work was supported by NSFC (Grant No. 61472431).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiqing Zhang.

Additional information

Shiqing Zhang received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where she is currently pursuing the MS degree. Her research interests include parallel programming and optimization techniques.

Zheng Qin received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where he is currently pursuing the MS degree. His research interests include machine learning, computer vision, and deep learning acceleration.

Yaohua Yang received the BS degree in Software Engineering from the Shan Dong University, China. Currently, he is a graduate student at National University of Defense Technology, China. His research interests are high performance processor and optimization techniques.

Li Shen received the BS, MS, and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include high performance processor architecture, parallel programming, and optimization techniques.

Zhiying Wang received the BS, MS and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include processor architecture, on-chip interconnect, and information security.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, S., Qin, Z., Yang, Y. et al. Transparent partial page migration between CPU and GPU. Front. Comput. Sci. 14, 143101 (2020). https://doi.org/10.1007/s11704-018-7386-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-018-7386-4

Keywords

Navigation