A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

Tabik, S.; Peemen, M.; Romero, L. F.

doi:10.1007/s11227-017-2184-6

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

Published: 11 November 2017

Volume 74, pages 1580–1608, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

323 Accesses
4 Citations
Explore all metrics

Abstract

This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost ($>25\%$, this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65$\times $ faster than the case in which we fully decompose our stencil without tiling and 5.3$\times $ faster with respect to the fully fused version on the NVIDIA GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Article Open access 14 January 2023

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Notes

This percentage can be measured or estimated experimentally for each single stencil.

References

Whitepaper nvidias next generation cuda compute architecture: Kepler tm gk110. In NVIDIA
Barash D (2002) Fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Trans Pattern Anal Mach Intell 24(6):844–847
Article Google Scholar
Dang V, El-Araby E, Dao L, Chang L-C (2013) Accelerating nonlinear diffusion tensor estimation for medical image processing using high performance GPU clusters. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 265–268
Fernandez J-J (2009) Tomobflow: feature-preserving noise filtering for electron tomography. BMC Bioinf 10(1):178
Article Google Scholar
Fernandez J-J, Lawrence AF, Roca J, Garcia I, Ellisman MH, Carazo JM (2002) High-performance electron tomography of complex biological specimens. J Struct Biol 138:6–20
Article MATH Google Scholar
Fernández J-J, Li S (2003) An improved algorithm for anisotropic nonlinear diffusion for denoising cryo-tomograms. J Struct Biol 144(1):152–161
Article Google Scholar
Fernandez J-J, Sam L (2005) Anisotropic nonlinear filtering of cellular structures in cryoelectron tomography. Comput Sci Eng 7(5):54–61
Article Google Scholar
Filipovič J, Madzin M, Fousek J, Matyska L (2015) Optimizing cuda code by kernel fusion: application on blas. J Supercomput 71(10):3934–3957
Article Google Scholar
Frangakis AS, Hegerl R (2001) Noise reduction in electron tomographic reconstructions using nonlinear anisotropic diffusion. J Struct Biol 135(3):239–250
Article Google Scholar
Frangakis AS, Stoschek A, Hegerl R (2001) Wavelet transform filtering and nonlinear anisotropic diffusion assessed for signal reconstruction performance on multidimensional biomedical data. IEEE Trans Biomed Eng 48(2):213–222
Article Google Scholar
Fehrenbach JMJ (2013) Small non-negative stencils for anisotropic diffusion. Numerical Analysis. arXiv:1301.3925
Fuller SH, Millett LI (2011) Computing performance: Game over or next level? Computer 1:31–38
Article Google Scholar
Gysi T, Grosser T, Hoefler T (2015) Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp 177–186
Holewinski J, Pouchet L-N, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp 311–320
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp 1–12. IEEE
Kuijper A, Schwarzkopf A, Kalbe T, Bajaj CL, Roth S, Goesele M (2013) 3d anisotropic diffusion on gpus by closed-form local tensor computations. Numer Math 6:72–94
MathSciNet MATH Google Scholar
Micikevicius P (2009) 3d finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp 79–84. ACM
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not 48(6):519–530
Article Google Scholar
Rumpf M, Strzodka R (2001) Nonlinear diffusion in graphics hardware. Springer, Berlin
Book MATH Google Scholar
Schfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036 (2011 Proceedings of the International Conference on Computational Science, ICCS)
Article Google Scholar
Schwarzkopf A, Kalbe T, Bajaj C, Kuijper A, Goesele M (2012) Volumetric nonlinear anisotropic diffusion on GPUs. Scale Space and Variational Methods in Computer Vision. Volume 6667 of Lecture Notes in Computer Science. Springer, Berlin, pp 62–73
Tabik S, Murarasu A, Romero L (2014) Anisotropic nonlinear diffusion for filtering 3d images on gpus. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp 339–345
Tabik S, Murarasu A, Romero L (2014) Evaluating the fission/fusion transformation of an iterative multiple 3d-stencil on gpus. In: 1st Int’l Workshop on High-Performance Stencil Computations (HiStencils 2014), pp 81–88
Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the gpu: iterative solvers as case study. J Supercomput 70(2):577–587
Article Google Scholar
Tabik S, Peemen M, Guil N, Corporaal H (2015) Demystifying the 16$\times $ 16 thread-block for stencils on the GPU. Concurr Comput Pract Exp 27(18):5557–557
Article Google Scholar
Weickert J (1998) Anisotropic diffusion in image processing. Teubner, Stuttgart
MATH Google Scholar
Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for gpus using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp 2433–2442. IEEE
Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on cpu-gpu heterogeneous computing systems. J Supercomput 73(5):1760–1781
Article Google Scholar
Zhao Y (2008) Lattice boltzmann based pde solver on the GPU. Vis Comput 24(5):323–333
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Junta de Andalusia under Projects TIC-8260 and P11-TIC-7176. Siham Tabik was supported by the Ramón y Cajal Programme (RYC-2015-18136).

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, University of Granada, 18071, Granada, Spain
S. Tabik
Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
M. Peemen
Department of Computer Architecture, University of Malaga, Málaga, Spain
L. F. Romero

Authors

S. Tabik
View author publications
You can also search for this author in PubMed Google Scholar
M. Peemen
View author publications
You can also search for this author in PubMed Google Scholar
L. F. Romero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Tabik.

Appendices

Appendix A

Appendix B

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tabik, S., Peemen, M. & Romero, L.F. A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study. J Supercomput 74, 1580–1608 (2018). https://doi.org/10.1007/s11227-017-2184-6

Download citation

Published: 11 November 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11227-017-2184-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

Abstract

Access this article

Similar content being viewed by others

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Notes

References

Acknowledgements