Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Holmen, John K.; Foster, David L.

doi:10.1007/s10766-013-0251-z

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Published: 26 May 2013

Volume 42, pages 343–363, (2014)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

John K. Holmen¹ &
David L. Foster¹

287 Accesses
2 Citations
Explore all metrics

An Erratum to this article was published on 18 February 2014

Abstract

The most commonly used approach for solving reaction–diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for the heavy reliance of stencil computations on neighboring data points to be exploited to significantly increase simulation speeds by reducing these memory bandwidth demands. Upon reviewing previously published works, a wide-variety of efforts have been made to optimize NVIDIA CUDA-based stencil computations. However, a critical aspect contributing to algorithm performance is commonly glossed over: the halo region loading technique utilized in conjunction with a given spatial blocking technique. This paper presents an in-depth examination of this aspect and the associated single iteration performance impacts when using symmetric, nearest neighbor 19-point stencils. This is accomplished by closely examining how the simulated space is partitioned into thread blocks and the balance between memory accesses, divergence, and computing threads. The resulting optimization strategy for accelerating 3-dimensional reaction–diffusion simulations offers up to 2.45 times speedup for single-precision floating point numbers in reference to GPU-based speedups found within the previously published work that this paper directly extends. In reference to our multithreaded CPU-based implementation, the resulting optimization strategy offers up to 8.69 times speedup for single-precision floating point numbers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating High-Order CFD Simulations for Multi-block Structured Grids on the TianHe-1A Supercomputer

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

Article 30 May 2020

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

References

Molnár Jr, F., Izsák, F., Mészároa, R., Lagzi, I.: Simulation of reaction–diffusion processes in three dimensions using CUDA. Chemom. Intell. Lab. Syst. 108(1), 76–85 (2011)
Article Google Scholar
Giles, M.: Jacobi Iteration for a Laplace Discretisation on a 3D Structured Grid. http://people.maths.ox.ac.uk/gilesm/cuda/prac3/laplace3d.pdf
Phillips, E.H., Fatica, M.: Implementing the Himeno Benchmark with CUDA on GPU Clusters. In: Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2010), pp. 1–10, April 2010
Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU2), pp. 79–84, March 2009
Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of the 10th IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2012), March/April 2012
Unat, D., Cai, X., Baden, S.B.: Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the International Conference on Supercomputing (ICS ’11), pp. 214–224, May/June 2011
Nguyen, N., Satish, Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and, Analysis (SC’10), pp. 1–13, November 2010
Yang, Y., Cui, H.-M., Feng, X.-B., Xue, J.-L.: A hybrid circular queue method for iterative stencil computations on GPUs. J. Comput. Sci. Technol. 27(1), 57–74 (2012)
Article Google Scholar
Holewinski, J., Pouchet, L.-N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing (ICS ’12), pp. 311–320, June 2012
Meng, J., Skadron, K.: Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs. In: Proceedings of the 23rd International Conference on Supercomputing (ICS ’09), pp. 256–265, June 2009
Kirk, D.B., Hwu, W.-M.W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, San Fransisco (2010)
Google Scholar
Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison Wesley, Reading (2010)
Google Scholar
Farber, R.: CUDA Application Design and Development. Morgan Kaufmann, San Fransisco (2011)
Google Scholar
NVIDIA Corporation, GeForce 8800 GTX - Specifications. http://www.geforce.com/hardware/desktop-gpus/geforce-8800-gtx/specifications
NVIDIA Corporation, GeForce GTX 275 - Specifications. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-275/specifications
EVGA, GeForce GTX 260 Core 216—Product Specification Sheet. http://www.evga.com/products/pdf/896-P3-1265.pdf
NVIDIA Corporation, Tesla C1060 Computing Processor Board Specification. http://nvidia.com/docs/IO/43395/BD-04111-001_v06.pdf
NVIDIA Corporation, GeForce GTX 560 Ti—Specifications. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-560ti/specifications
Molnár, F. Jr., Izsák, F., Mészároa, R., Lagzi, I.: Simulation of Reaction-Diffusion Processes in Three Dimensions using CUDA. http://nimbus.elte.hu/~uda/RD/cuda.html. 2009
NVIDIA Corporation, CUDA C Best Practices Guide v4.1. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf

Download references

Acknowledgments

This research was supported by an equipment donation from the NVIDIA Corporation as a part of the Academic Partnership Program.

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, Kettering University, Flint, MI, USA
John K. Holmen & David L. Foster

Authors

John K. Holmen
View author publications
You can also search for this author in PubMed Google Scholar
David L. Foster
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John K. Holmen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Holmen, J.K., Foster, D.L. Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations. Int J Parallel Prog 42, 343–363 (2014). https://doi.org/10.1007/s10766-013-0251-z

Download citation

Received: 10 January 2013
Accepted: 10 May 2013
Published: 26 May 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10766-013-0251-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Abstract

Access this article

Similar content being viewed by others

Accelerating High-Order CFD Simulations for Multi-block Structured Grids on the TianHe-1A Supercomputer

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

Abstract

Access this article

Similar content being viewed by others

Accelerating High-Order CFD Simulations for Multi-block Structured Grids on the TianHe-1A Supercomputer

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation