Skip to main content
Log in

Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of computation reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on GPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with “element view”, to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7 % performance improvement on modern GPUs such as NVIDIA C2050.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Zhang Y, Mueller F. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proc. the 10th Int. Symp. Code Generation and Optimization, Mar. 2012, pp.155-164.

  • Holewinski J, Pouchet L, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM Int. Conf. Supercomputing, Jun. 2012, pp.311-320.

  • Lutz T, Fensch C, Cole M. PARTANS: An autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim., 2013, 9(4): 59:1–59:24.

  • Krotkiewski M, Dabrowski M (2013) Efficient 3D stencil computations using CUDA. Parallel Computing 39(10):533–548

    Article  MathSciNet  Google Scholar 

  • Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Mar. 2009, pp.79-84.

  • Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp.1-13.

  • Fan Z. Vectorization Theory. China Science Press, 1988. (in Chinese)

  • Allen J, Kennedy K (2002) Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

    Google Scholar 

  • Cohen A, Sigler M, Girbal S, Temam O, Parello D, Vasilache N. Facilitating the search for compositions of program transformations. In Proc. the 19th Int. Conf. Supercomputing, Jun. 2005, pp.151-160.

  • Pouchet L. Interative optimization in the polyhedral model [Ph.D. Thesis]. University of Paris-Sud 11, Orsay, France, Jan 2010.

  • Deitz S, Chamberlain B, Snyder L. Eliminating redundancies in sum-of-product array computations. In Proc. the 15th International Conference on Supercomputing, Jun. 2001, pp.65-77.

  • Basu P, Hall M,Williams S, Van Straalen B et al. Compilerdirected transformation for higher-order stencils. In Proc. the 29th Int. Parallel & Distributed Processing Symp., May 2015, pp.313-323.

  • Gr¨oßlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th Int. Conf. Compiler Construction, Mar. 2009, pp.236-250.

  • Issenin I, Brockmeyer E, Miranda M, Dutt N. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 2007, 12(2): Article No. 15.

  • Ma W, Agrawal G. An integer programming framework for optimizing shared memory use on GPUs. In Proc. the 17th IEEE Int. Conf. High Performance Computing, Dec. 2010.

  • Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J, Sadayappan P. A framework for enhancing data reuse via associative reordering. In Proc. the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2014, pp.65-76.

  • Tseng H, Tullsen D (2012) Eliminating redundant computation and exposing parallelism through data-triggered threads. IEEE Micro 32(3):38–47

    Article  Google Scholar 

  • Tseng H, Tullsen D. Software data-triggered threads. In Proc. the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Oct. 2012, pp.703-716.

  • Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proc. the 43rd IEEE/ACM Int. Symp. Microarchitecture, Dec. 2010, pp.337-348.

  • Ding Y, Li Z. A compiler scheme for reusing intermediate computation results. In Proc. Int. Symp. Code Generation and Optimization, Mar. 2004, pp.277-288.

  • Hammer M, Acar U, Chen Y. CEAL: A C-based language for self-adjusting computation. In Proc. the ACM SIGPLAN Conf. Programming Language Design and Implementation, Jun. 2009, pp.25-37.

  • Gautam, Rajopadhye S. Simplifying reductions. In Proc. the 33rd ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 2006, pp.30-41.

  • Fan Z. Investigation on vectorization problem. In Proc. China-US Symp. Computer Software Engineering, April 1982.

  • Su H, Wu N, Wen M, Zhang C, Cai X. On the GPU performance of 3D stencil computations implemented in OpenCL. In Lecture Notes in Computer Science 7905, Kuâkel J, Ludwig T, Meuer H W (eds.), Springer Berlin Heidelberg, 2013, pp.125-135.

  • Datta K, Murphy M, Volkov V,Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Mar. 2008.

  • Luo Y, Tan G, Mo Z, Sun N. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proc. the 29th ACM Int. Conf. Supercomputing, Jun. 2015, pp.187-196.

  • Meng J, Skadron K (2011) A performance study for iterative stencil loops on GPUs with ghost zone optimizations. International Journal of Parallel Programming 39(1):115–142

    Article  Google Scholar 

  • Yang Y, Cui H, Feng X, Xue J (2012) A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology 27(1):57–74

    Article  Google Scholar 

  • Cecilia J, García J, Ujaldón M. CUDA 2D stencil computations for the Jacobi method. In Proc. the 10th International Conference on Applied Parallel and Scientific Computing — Volume Part I, June 2012, pp.173-183.

  • Kurzak J, Bader D, Dongarra J. Scientific Computing with Multicore and Accelerators (1st edition). CRC Press, 2010.

  • Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2013, pp.519-530.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guo-Ping Long.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, WJ., Gao, K. & Long, GP. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs. J. Comput. Sci. Technol. 31, 1262–1274 (2016). https://doi.org/10.1007/s11390-016-1696-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-016-1696-5

Keywords

Navigation