Pragma Directed Shared Memory Centric Optimizations on GPUs

Li, Jing; Liu, Lei; Wu, Yuan; Liu, Xiang-Hua; Gao, Yi; Feng, Xiao-Bing; Wu, Cheng-Yong

doi:10.1007/s11390-016-1624-8

Pragma Directed Shared Memory Centric Optimizations on GPUs

Regular Paper
Published: 07 March 2016

Volume 31, pages 235–252, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jing Li^1,2,
Lei Liu¹,
Yuan Wu³,
Xiang-Hua Liu³,
Yi Gao³,
Xiao-Bing Feng¹ &
…
Cheng-Yong Wu¹

174 Accesses
Explore all metrics

Abstract

GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Manage OpenMP GPU Data Environment Under Unified Address Space

OpenMP Target Offload Utilizing GPU Shared Memory

Astute Approach to Handling Memory Layouts of Regular Data Structures

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. http://www.cs.colostate.edu/∼cs675/MatrixTranspose. pdf, Jan. 2009.
Fujimoto N. Faster matrix–vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.
Van Werkhoven B, Maassen J, Bal H E, Seinstra F J. Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst., 2014, 30: 14–26.
Article Google Scholar
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.
Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2010, pp.86-97.
Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th International Symposium on Systems Synthesis, Sept. 2001, pp.7-12.
Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Languages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.1-15.
Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.
Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd International Conference on Parallel Architectures and Compilation, Aug. 2014, pp.163-174.
Schäfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027–2036.
Volkov V. Better performance at lower occupancy. www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf, Dec. 2014.
Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2008, pp.101-113.
Bastoul C. Code generation in the polyhedral model is easier than you think. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.7-16.
Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, Jun. 2008, pp.225-234.
Baskaran M, Ramanujam J, Sadayappan P. Automatic Cto-CUDA code generation for affine programs. In Proc. the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, Mar. 2010, pp.244-263.
Pouchet L N. Polyhedral compilation foundations. http://web.cs.ucla.edu/∼pouchet/lectures/doc/888.11.2.pdf, Dec. 2014.
Murthy G S, Ravishankar M, Baskaran M M, Sadayappan P. Optimal loop unrolling for GPGPU programs. In Proc. the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.
Liu L, Li Y, Cui Z, Bao Y, Chen M, Wu C. Going vertical in memory management: Handling multiplicity by multipolicy. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.169-180
Gao S. Improving GPU shared memory access efficiency [Ph.D. Thesis]. University of Tennessee, 2014.
Gou C, Gaydadjiev G. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming, 2013, 41(3): 400–429.
Article Google Scholar
Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.
Lee S I, Johnson T, Eigenmann R. Cetus — An extensible compiler infrastructure for source-to-source transformation. In Lecture Notes in Computer Science 2958, Rauchwerger L (ed.), Springer Berlin Heidelberg, 2004, pp.539-553.
Lee S, Min S, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp.101-110.
Wienke S, Springer P, Terboven C, an Mey D. OpenACC — First experiences with real-world applications. In Lecture Notes in Computer Science 7484, Kaklamanis C, Papatheodorou T, Spirakis P G (eds.), Springer Berlin Heidelberg, 2012, pp.859-870.
Catanzaro B, Garland M, Keutzer K. Copperhead: Compiling an embedded data parallel language. Technical Report, UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sept. 2010.
Reyes R, López I, Fumero J, de Sande F. A preliminary evaluation of OpenACC implementations. The Journal of Supercomputing, 2013, 65(3): 1063–1075.
Article Google Scholar
Fang J, Varbanescu A, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In Proc. the International Conference on Parallel Processing, Sept. 2011, pp.216-225.
Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL. arXiv: 1005.2581, 2010. http://arvix.org/abs/1005.2581, Jan. 2016.
Li C, Yang Y, Dai H, Yan S, Mueller F, Zhou H. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar. 2014, pp.231-242.
Chen G, Wu B, Li D, Shen X. PORPLE: An extensible optimizer for portable data placement on GPU. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.88-100.
van den Braak G, Mesman B, Corporaal H. Compile-time GPU memory access optimizations. In Proc. the 2010 International Conference on Embedded Computer Systems (SAMOS), Jul. 2010, pp.200-207.
Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.1-10.
Baghdadi S, Gröβlinger A, Cohen A. Putting automatic polyhedral compilation for GPGPU to work. In Proc. the 15th Workshop Compilers for Parallel Computers, Jul. 2010.
Gröβlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th International Conference on Compiler Construction, Mar. 2009, pp.236-250.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Jing Li, Lei Liu, Xiao-Bing Feng & Cheng-Yong Wu
University of Chinese Academy of Sciences, Beijing, 100049, China
Jing Li
Beijing Samsung Telecom Research and Development Center, Beijing, 100028, China
Yuan Wu, Xiang-Hua Liu & Yi Gao

Authors

Jing Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiang-Hua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Bing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Yong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Li.

Additional information

This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China (NSFC) under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Liu, L., Wu, Y. et al. Pragma Directed Shared Memory Centric Optimizations on GPUs. J. Comput. Sci. Technol. 31, 235–252 (2016). https://doi.org/10.1007/s11390-016-1624-8

Download citation

Received: 04 January 2015
Revised: 25 August 2015
Published: 07 March 2016
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11390-016-1624-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pragma Directed Shared Memory Centric Optimizations on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Manage OpenMP GPU Data Environment Under Unified Address Space

OpenMP Target Offload Utilizing GPU Shared Memory

Astute Approach to Handling Memory Layouts of Regular Data Structures

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pragma Directed Shared Memory Centric Optimizations on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Manage OpenMP GPU Data Environment Under Unified Address Space

OpenMP Target Offload Utilizing GPU Shared Memory

Astute Approach to Handling Memory Layouts of Regular Data Structures

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation