Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

Ma, Wen-Jing; Gao, Kan; Long, Guo-Ping

doi:10.1007/s11390-016-1696-5

Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

Regular Paper
Published: 09 November 2016

Volume 31, pages 1262–1274, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Wen-Jing Ma^1,2,
Kan Gao³ &
Guo-Ping Long¹

113 Accesses
5 Citations
Explore all metrics

Abstract

Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of computation reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on GPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with “element view”, to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7 % performance improvement on modern GPUs such as NVIDIA C2050.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Employing Polyhedral Methods to Reduce Data Movement in FPGA Stencil Codes

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Article 21 January 2015

References

Zhang Y, Mueller F. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proc. the 10th Int. Symp. Code Generation and Optimization, Mar. 2012, pp.155-164.
Holewinski J, Pouchet L, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM Int. Conf. Supercomputing, Jun. 2012, pp.311-320.
Lutz T, Fensch C, Cole M. PARTANS: An autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim., 2013, 9(4): 59:1–59:24.
Krotkiewski M, Dabrowski M (2013) Efficient 3D stencil computations using CUDA. Parallel Computing 39(10):533–548
Article MathSciNet Google Scholar
Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Mar. 2009, pp.79-84.
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2010, pp.1-13.
Fan Z. Vectorization Theory. China Science Press, 1988. (in Chinese)
Allen J, Kennedy K (2002) Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Google Scholar
Cohen A, Sigler M, Girbal S, Temam O, Parello D, Vasilache N. Facilitating the search for compositions of program transformations. In Proc. the 19th Int. Conf. Supercomputing, Jun. 2005, pp.151-160.
Pouchet L. Interative optimization in the polyhedral model [Ph.D. Thesis]. University of Paris-Sud 11, Orsay, France, Jan 2010.
Deitz S, Chamberlain B, Snyder L. Eliminating redundancies in sum-of-product array computations. In Proc. the 15th International Conference on Supercomputing, Jun. 2001, pp.65-77.
Basu P, Hall M,Williams S, Van Straalen B et al. Compilerdirected transformation for higher-order stencils. In Proc. the 29th Int. Parallel & Distributed Processing Symp., May 2015, pp.313-323.
Gr¨oßlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th Int. Conf. Compiler Construction, Mar. 2009, pp.236-250.
Issenin I, Brockmeyer E, Miranda M, Dutt N. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 2007, 12(2): Article No. 15.
Ma W, Agrawal G. An integer programming framework for optimizing shared memory use on GPUs. In Proc. the 17th IEEE Int. Conf. High Performance Computing, Dec. 2010.
Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J, Sadayappan P. A framework for enhancing data reuse via associative reordering. In Proc. the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2014, pp.65-76.
Tseng H, Tullsen D (2012) Eliminating redundant computation and exposing parallelism through data-triggered threads. IEEE Micro 32(3):38–47
Article Google Scholar
Tseng H, Tullsen D. Software data-triggered threads. In Proc. the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Oct. 2012, pp.703-716.
Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proc. the 43rd IEEE/ACM Int. Symp. Microarchitecture, Dec. 2010, pp.337-348.
Ding Y, Li Z. A compiler scheme for reusing intermediate computation results. In Proc. Int. Symp. Code Generation and Optimization, Mar. 2004, pp.277-288.
Hammer M, Acar U, Chen Y. CEAL: A C-based language for self-adjusting computation. In Proc. the ACM SIGPLAN Conf. Programming Language Design and Implementation, Jun. 2009, pp.25-37.
Gautam, Rajopadhye S. Simplifying reductions. In Proc. the 33rd ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 2006, pp.30-41.
Fan Z. Investigation on vectorization problem. In Proc. China-US Symp. Computer Software Engineering, April 1982.
Su H, Wu N, Wen M, Zhang C, Cai X. On the GPU performance of 3D stencil computations implemented in OpenCL. In Lecture Notes in Computer Science 7905, Kuâkel J, Ludwig T, Meuer H W (eds.), Springer Berlin Heidelberg, 2013, pp.125-135.
Datta K, Murphy M, Volkov V,Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Mar. 2008.
Luo Y, Tan G, Mo Z, Sun N. FAST: A fast stencil autotuning framework based on an optimal-solution space model. In Proc. the 29th ACM Int. Conf. Supercomputing, Jun. 2015, pp.187-196.
Meng J, Skadron K (2011) A performance study for iterative stencil loops on GPUs with ghost zone optimizations. International Journal of Parallel Programming 39(1):115–142
Article Google Scholar
Yang Y, Cui H, Feng X, Xue J (2012) A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology 27(1):57–74
Article Google Scholar
Cecilia J, García J, Ujaldón M. CUDA 2D stencil computations for the Jacobi method. In Proc. the 10th International Conference on Applied Parallel and Scientific Computing — Volume Part I, June 2012, pp.173-183.
Kurzak J, Bader D, Dongarra J. Scientific Computing with Multicore and Accelerators (1st edition). CRC Press, 2010.
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2013, pp.519-530.

Download references

Author information

Authors and Affiliations

Laboratory of Parallel Software and Computing Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Wen-Jing Ma & Guo-Ping Long
State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Wen-Jing Ma
Information Center, China Association for Science and Technology, Beijing, 100863, China
Kan Gao

Authors

Wen-Jing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Kan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Ping Long
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guo-Ping Long.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, WJ., Gao, K. & Long, GP. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs. J. Comput. Sci. Technol. 31, 1262–1274 (2016). https://doi.org/10.1007/s11390-016-1696-5

Download citation

Received: 15 October 2015
Revised: 07 July 2016
Published: 09 November 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s11390-016-1696-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

Abstract

Access this article

Similar content being viewed by others

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Employing Polyhedral Methods to Reduce Data Movement in FPGA Stencil Codes

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs

Abstract

Access this article

Similar content being viewed by others

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Employing Polyhedral Methods to Reduce Data Movement in FPGA Stencil Codes

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation