Compiler Optimization of Accelerator Data Transfers

Ashcraft, Matthew B.; Lemon, Alexander; Penry, David A.; Snell, Quinn

doi:10.1007/s10766-017-0549-3

Compiler Optimization of Accelerator Data Transfers

Published: 30 December 2017

Volume 47, pages 39–58, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Matthew B. Ashcraft¹,
Alexander Lemon¹,
David A. Penry² &
…
Quinn Snell¹

737 Accesses
Explore all metrics

Abstract

Accelerators such as GPUs, FPGAs, and many-core processors can provide significant performance improvements, but their effectiveness is dependent upon the skill of programmers to manage their complex architectures. One area of difficulty is determining which data to transfer on and off of the accelerator and when. Poorly placed data transfers can result in overheads that completely dwarf the benefits of using accelerators. To know what data to transfer, and when, the programmer must understand the data-flow of the transferred memory locations throughout the program, and how the accelerator region fits into the program as a whole. We argue that compilers should take on the responsibility of data transfer scheduling, thereby reducing the demands on the programmer, and resulting in improved program performance and program efficiency from the reduction in the number of bytes transferred. We show that by performing whole-program transfer scheduling on accelerator data transfers we are able to automatically eliminate up to 99% of the bytes transferred to and from the accelerator compared to transfering all data immediately before and after kernel execution for all data involved. The analysis and optimization are language and accelerator-agnostic, but for our examples and testing they have been implemented into an OpenMP to LLVM-IR to CUDA workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bourgoin, M., Emmanuel, C.: GPGPU composition with OCaml. In: Poceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY14, pp. 32–37 (2012)
Bourgoin, M., Chailloux, E., Lamotte, J.L.: SPOC: GPGPU programming through stream processing with OCaml. Parallel Process. Lett. 22, 1240007 (2012)
Article MathSciNet Google Scholar
Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for gpgpu programming. IJPP 42, 583–600 (2014)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 44–54 (2009)
CUDA C Programming Guide, Version 8.0. NVIDIA Corporation (2016)
Fujii, Y., Azumi, T., Nishio, N., Kato, S., Edahiro, M.: Data transfer matters for GPU computing. In: ICPADS (2013)
Gelado, I., Stone, J.E., Cabezas, J., Patel, J., Navarro, N., Mei W., Hwu, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 247–258 (2010)
Ishizaki, K., Hayashi, A., Koblents, G., Sarkar, V.: Compiling and optimizing Java 8 programs for GPU execution. In: Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (2015)
Kim, J., Lee, Y.J., Park, J., Lee, J.: Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. In: Proceedings of the 2016 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2016)
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization, pp. 75–86 (2004)
Lattner, C., Lenharth, A., Adve, V.: Making context-sensitive points-to analysis with heap cloning practical. In: Proceedings of the 2007 Conference on Programming Language Design and Implementation (2007)
Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)
Lengauer, T., Tarjan, R.E.: A fast algorithm for finding dominators in a flowgraph. ACM Trans. Program. Lang. Syst. 1, 121–141 (1979)
Article MATH Google Scholar
Leroy, X., Doligez, D., Firsch, A., Garrigue, J., Remy, D.R., Vouillon, J.: The OCaml System Release 4.01: Documentation and Users Manual (2013)
Lustig, D., Martonosi, M.: Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 354–365 (2013)
OpenMP Application Program Interface, Version 4.0. OpenMP Architecture Review Board (2013)
The OpenCL Specification, Version 2.2. Khronos OpenCL Working Group (2016)
Vassiliadis, V., Antonopoulos, C.D., Zindros, G.: Automating data management in heterogeneous systems using polyhedral analysis. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 317–322 (2015)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimizations and parallelism management. In: Proceedings of the 31st Conference on Programming Language Design and Implementation, pp. 86–97 (2010)

Download references

Acknowledgements

This work was supported by National Science Foundation Grant CNS-1054075.

Author information

Authors and Affiliations

Brigham Young Univeristy, Provo, UT, USA
Matthew B. Ashcraft, Alexander Lemon & Quinn Snell
ARM Ltd, Chandler, AZ, USA
David A. Penry

Authors

Matthew B. Ashcraft
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Lemon
View author publications
You can also search for this author in PubMed Google Scholar
David A. Penry
View author publications
You can also search for this author in PubMed Google Scholar
Quinn Snell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew B. Ashcraft.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ashcraft, M.B., Lemon, A., Penry, D.A. et al. Compiler Optimization of Accelerator Data Transfers. Int J Parallel Prog 47, 39–58 (2019). https://doi.org/10.1007/s10766-017-0549-3

Download citation

Received: 31 May 2017
Accepted: 17 December 2017
Published: 30 December 2017
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10766-017-0549-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compiler Optimization of Accelerator Data Transfers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HYDRA : Extending Shared Address Programming for Accelerator Clusters

The Celerity High-level API: C++20 for Accelerator Clusters

Experiences of Using the OpenMP Accelerator Model to Port DOE Stencil Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Compiler Optimization of Accelerator Data Transfers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HYDRA : Extending Shared Address Programming for Accelerator Clusters

The Celerity High-level API: C++20 for Accelerator Clusters

Experiences of Using the OpenMP Accelerator Model to Port DOE Stencil Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now