A new memory mapping mechanism for GPGPUs’ stencil computation

Mo, Tieqiang; Li, Renfa

doi:10.1007/s00607-014-0434-5

A new memory mapping mechanism for GPGPUs’ stencil computation

Published: 11 November 2014

Volume 97, pages 795–812, (2015)
Cite this article

Computing Aims and scope Submit manuscript

Tieqiang Mo¹ &
Renfa Li¹

367 Accesses
1 Citation
Explore all metrics

Abstract

When optimizing performance on a GPU, control flow divergence of threads in one warp can make up the possible performance bottlenecks. In our hand-coded GPU stencil computation optimization, with a view to remove this control flow divergence brought by conventional mapping method between global memory and shared memory, we devise a new mapping mechanism by modeling the coalesced memory accesses of GPU threads and the aligned ghost zone overheads to remove conditional statements of the boundary XY-tile stencil computation points for improved performance. In addition, we utilize only one XY-tile loaded into registers in every stencil computation iteration, common sub-expression elimination and software prefetching to reduce overheads. Finally, detailed performance evaluation demonstrates that global memory access traffic is close to the idealized lower bound value through our optimized policies, that is to say, in every computed point of one XY-tile the memory access traffic is roughly 6 and 4 % more than 8 bytes per XY-tile point of the idealized lower bound memory access traffic in which ghost zone overheads are not taken into consideration on Tesla C2050 and Kepler K20X respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Article 20 February 2023

Jingcheng Shen, Linbo Long, … Fumihiko Ino

References

Taflove A (2005) Computational electrodynamics: the finite-difference time-domain method. Artech House Publishers, Boston
Google Scholar
Smith G (2004) Numerical solution of partial differential equations: finite difference methods. Oxford University Press, Philadelphia
Google Scholar
Cong J, Huang M, Zou Y (2011) Accelerating fluid registration algorithm on multi-FPGA platforms. Field programmable logic and application, 2011 international conference on 5–7 Sept 2011 IEEE computer society press: Chania, USA, pp 50–57. doi:10.1109/FPL.2011.20
Datta K, Williams S, Volkov V, Carter J, Oliker L, Shalf J, Yelick K (2009) Auto-tuning the 27-point stencil for multicore. In iWAPT, 4th international workshop on automatic performance tuning
Meng J, Skadron K (2011) A performance study for iterative stencil loops on gpus with ghost zone optimizations. Int J Parallel Program 39(1):115–142. doi:10.1007/s10766-010-0142-5
Article Google Scholar
Micikevicius P (2009) 3d finite difference computation on gpus using cuda, in: GPGPU-2. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, ACM, New York, pp 79–84. doi:10.1145/1513895.1513905
Everett H (2010) Phillips and massimiliano fatica implementing the himeno benchmark with CUDA on GPU clusters. In: Parallel distributed processing (IPDPS), IEEE international sSymposium on, 19–23, IEEE computer society Atlanta, GA, pp 1–10. doi:10.1109/IPDPS.2010.5470394
Zhang Y, Mueller F Auto-generation and auto-tuning of 3d stencil codes on gpu clusters. In: Proceedings of the tenth international symposium on code generation and optimization, CGO ’12, ACM, New York, USA, pp 155–164. doi:10.1145/2259016.2259037
NVIDIA corporation CUDA C programming guide programming guide, Version 5.0 2012
Christen M, Schenk O, Burkhart H (2011) Patus: a code generation and auto-tuning framework for parallel iterative stencil computations on modern microarchitectures. In: Parallel distributed processing symposium (IPDPS) IEEE international, 16–20, Anchorage, AK, pp 676–687. doi:10.1109/IPDPS.2011.70
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern cpus and gpus, In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, 13–19 Nov. 2010, IEEE computer Society New Orleans, LA, pp 1–13. doi:10.1109/SC.2010.2
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: IPDPS, 2010 IEEE international symposium on, 19–23, IEEE computer society Atlanta, GA, pp 1–12. doi:10.1109/IPDPS.2010.5470421
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM international conference on Supercomputing, ICS ’12, ACM, New York, NY, pp 311–320. doi:10.1145/2304567.2304619
Tang Y, Chowdhury R.A, Kuszmaul B.C, Luk CK, Leiserson CE (2011) The Pochoir stencil compiler. SPAA’11, ACM New York, NY, pp 117–128. doi:10.1145/1989493.1989508
Unat D, Cai X, Baden S (2011) Mint: realizing CUDA performance in 3D stencil methods with Annotated C. In: Proceedings of the 25th international conference on supercomputing, May 31–June 4, ACM: TuScon, Arizona, USA, pp 214–224. doi:10.1145/1995896.19959
Maruyama N, Aoki T (2014) Optimizing stencil computations for NVIDIA Kepler GPUs. First international workshop on high-performance stencil computations, January 21, Vienna, Austria
Kutz JN (2013) Data-driven modeling and scientific computing: methods for integrating dynamics of complex systems and big data. Oxford University Press
Merrill D, Garland M, Grimshaw A (2012) Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, New Orleans, Louisiana. pp 117–128

Download references

Acknowledgments

The authors would like to thank for the anonymous reviewers for their valuable comments and suggestion. This work was partially supported by the National Natural Science Foundation of China numbered from 61173036.

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Tieqiang Mo & Renfa Li

Authors

Tieqiang Mo
View author publications
You can also search for this author in PubMed Google Scholar
Renfa Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tieqiang Mo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mo, T., Li, R. A new memory mapping mechanism for GPGPUs’ stencil computation. Computing 97, 795–812 (2015). https://doi.org/10.1007/s00607-014-0434-5

Download citation

Received: 02 June 2014
Accepted: 29 October 2014
Published: 11 November 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s00607-014-0434-5

Keywords

Mathematics Subject Classification

65Y05

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new memory mapping mechanism for GPGPUs’ stencil computation

Abstract

Access this article

Similar content being viewed by others

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Abstract

Access this article

Similar content being viewed by others

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation