Skip to main content
Log in

Pragma Directed Shared Memory Centric Optimizations on GPUs

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. http://www.cs.colostate.edu/∼cs675/MatrixTranspose. pdf, Jan. 2009.

  2. Fujimoto N. Faster matrix–vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.

  3. Van Werkhoven B, Maassen J, Bal H E, Seinstra F J. Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst., 2014, 30: 14–26.

    Article  Google Scholar 

  4. Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.

  5. Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2010, pp.86-97.

  6. Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th International Symposium on Systems Synthesis, Sept. 2001, pp.7-12.

  7. Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Languages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.1-15.

  8. Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.

  9. Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd International Conference on Parallel Architectures and Compilation, Aug. 2014, pp.163-174.

  10. Schäfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027–2036.

  11. Volkov V. Better performance at lower occupancy. www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf, Dec. 2014.

  12. Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2008, pp.101-113.

  13. Bastoul C. Code generation in the polyhedral model is easier than you think. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.7-16.

  14. Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, Jun. 2008, pp.225-234.

  15. Baskaran M, Ramanujam J, Sadayappan P. Automatic Cto-CUDA code generation for affine programs. In Proc. the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, Mar. 2010, pp.244-263.

  16. Pouchet L N. Polyhedral compilation foundations. http://web.cs.ucla.edu/∼pouchet/lectures/doc/888.11.2.pdf, Dec. 2014.

  17. Murthy G S, Ravishankar M, Baskaran M M, Sadayappan P. Optimal loop unrolling for GPGPU programs. In Proc. the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.

  18. Liu L, Li Y, Cui Z, Bao Y, Chen M, Wu C. Going vertical in memory management: Handling multiplicity by multipolicy. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.169-180

  19. Gao S. Improving GPU shared memory access efficiency [Ph.D. Thesis]. University of Tennessee, 2014.

  20. Gou C, Gaydadjiev G. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming, 2013, 41(3): 400–429.

    Article  Google Scholar 

  21. Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.

  22. Lee S I, Johnson T, Eigenmann R. Cetus — An extensible compiler infrastructure for source-to-source transformation. In Lecture Notes in Computer Science 2958, Rauchwerger L (ed.), Springer Berlin Heidelberg, 2004, pp.539-553.

  23. Lee S, Min S, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp.101-110.

  24. Wienke S, Springer P, Terboven C, an Mey D. OpenACC — First experiences with real-world applications. In Lecture Notes in Computer Science 7484, Kaklamanis C, Papatheodorou T, Spirakis P G (eds.), Springer Berlin Heidelberg, 2012, pp.859-870.

  25. Catanzaro B, Garland M, Keutzer K. Copperhead: Compiling an embedded data parallel language. Technical Report, UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sept. 2010.

  26. Reyes R, López I, Fumero J, de Sande F. A preliminary evaluation of OpenACC implementations. The Journal of Supercomputing, 2013, 65(3): 1063–1075.

    Article  Google Scholar 

  27. Fang J, Varbanescu A, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In Proc. the International Conference on Parallel Processing, Sept. 2011, pp.216-225.

  28. Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL. arXiv: 1005.2581, 2010. http://arvix.org/abs/1005.2581, Jan. 2016.

  29. Li C, Yang Y, Dai H, Yan S, Mueller F, Zhou H. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar. 2014, pp.231-242.

  30. Chen G, Wu B, Li D, Shen X. PORPLE: An extensible optimizer for portable data placement on GPU. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.88-100.

  31. van den Braak G, Mesman B, Corporaal H. Compile-time GPU memory access optimizations. In Proc. the 2010 International Conference on Embedded Computer Systems (SAMOS), Jul. 2010, pp.200-207.

  32. Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.1-10.

  33. Baghdadi S, Gröβlinger A, Cohen A. Putting automatic polyhedral compilation for GPGPU to work. In Proc. the 15th Workshop Compilers for Parallel Computers, Jul. 2010.

  34. Gröβlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th International Conference on Compiler Construction, Mar. 2009, pp.236-250.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Li.

Additional information

This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China (NSFC) under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Liu, L., Wu, Y. et al. Pragma Directed Shared Memory Centric Optimizations on GPUs. J. Comput. Sci. Technol. 31, 235–252 (2016). https://doi.org/10.1007/s11390-016-1624-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-016-1624-8

Keywords

Navigation