Skip to main content
Log in

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (\(>25\%\), this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65\(\times \) faster than the case in which we fully decompose our stencil without tiling and 5.3\(\times \) faster with respect to the fully fused version on the NVIDIA GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. This percentage can be measured or estimated experimentally for each single stencil.

References

  1. Whitepaper nvidias next generation cuda compute architecture: Kepler tm gk110. In NVIDIA

  2. Barash D (2002) Fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Trans Pattern Anal Mach Intell 24(6):844–847

    Article  Google Scholar 

  3. Dang V, El-Araby E, Dao L, Chang L-C (2013) Accelerating nonlinear diffusion tensor estimation for medical image processing using high performance GPU clusters. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 265–268

  4. Fernandez J-J (2009) Tomobflow: feature-preserving noise filtering for electron tomography. BMC Bioinf 10(1):178

    Article  Google Scholar 

  5. Fernandez J-J, Lawrence AF, Roca J, Garcia I, Ellisman MH, Carazo JM (2002) High-performance electron tomography of complex biological specimens. J Struct Biol 138:6–20

    Article  MATH  Google Scholar 

  6. Fernández J-J, Li S (2003) An improved algorithm for anisotropic nonlinear diffusion for denoising cryo-tomograms. J Struct Biol 144(1):152–161

    Article  Google Scholar 

  7. Fernandez J-J, Sam L (2005) Anisotropic nonlinear filtering of cellular structures in cryoelectron tomography. Comput Sci Eng 7(5):54–61

    Article  Google Scholar 

  8. Filipovič J, Madzin M, Fousek J, Matyska L (2015) Optimizing cuda code by kernel fusion: application on blas. J Supercomput 71(10):3934–3957

    Article  Google Scholar 

  9. Frangakis AS, Hegerl R (2001) Noise reduction in electron tomographic reconstructions using nonlinear anisotropic diffusion. J Struct Biol 135(3):239–250

    Article  Google Scholar 

  10. Frangakis AS, Stoschek A, Hegerl R (2001) Wavelet transform filtering and nonlinear anisotropic diffusion assessed for signal reconstruction performance on multidimensional biomedical data. IEEE Trans Biomed Eng 48(2):213–222

    Article  Google Scholar 

  11. Fehrenbach JMJ (2013) Small non-negative stencils for anisotropic diffusion. Numerical Analysis. arXiv:1301.3925

  12. Fuller SH, Millett LI (2011) Computing performance: Game over or next level? Computer 1:31–38

    Article  Google Scholar 

  13. Gysi T, Grosser T, Hoefler T (2015) Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp 177–186

  14. Holewinski J, Pouchet L-N, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp 311–320

  15. Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp 1–12. IEEE

  16. Kuijper A, Schwarzkopf A, Kalbe T, Bajaj CL, Roth S, Goesele M (2013) 3d anisotropic diffusion on gpus by closed-form local tensor computations. Numer Math 6:72–94

    MathSciNet  MATH  Google Scholar 

  17. Micikevicius P (2009) 3d finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp 79–84. ACM

  18. Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not 48(6):519–530

    Article  Google Scholar 

  19. Rumpf M, Strzodka R (2001) Nonlinear diffusion in graphics hardware. Springer, Berlin

    Book  MATH  Google Scholar 

  20. Schfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036 (2011 Proceedings of the International Conference on Computational Science, ICCS)

    Article  Google Scholar 

  21. Schwarzkopf A, Kalbe T, Bajaj C, Kuijper A, Goesele M (2012) Volumetric nonlinear anisotropic diffusion on GPUs. Scale Space and Variational Methods in Computer Vision. Volume 6667 of Lecture Notes in Computer Science. Springer, Berlin, pp 62–73

  22. Tabik S, Murarasu A, Romero L (2014) Anisotropic nonlinear diffusion for filtering 3d images on gpus. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp 339–345

  23. Tabik S, Murarasu A, Romero L (2014) Evaluating the fission/fusion transformation of an iterative multiple 3d-stencil on gpus. In: 1st Int’l Workshop on High-Performance Stencil Computations (HiStencils 2014), pp 81–88

  24. Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the gpu: iterative solvers as case study. J Supercomput 70(2):577–587

    Article  Google Scholar 

  25. Tabik S, Peemen M, Guil N, Corporaal H (2015) Demystifying the 16$\times $ 16 thread-block for stencils on the GPU. Concurr Comput Pract Exp 27(18):5557–557

    Article  Google Scholar 

  26. Weickert J (1998) Anisotropic diffusion in image processing. Teubner, Stuttgart

    MATH  Google Scholar 

  27. Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for gpus using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp 2433–2442. IEEE

  28. Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on cpu-gpu heterogeneous computing systems. J Supercomput 73(5):1760–1781

    Article  Google Scholar 

  29. Zhao Y (2008) Lattice boltzmann based pde solver on the GPU. Vis Comput 24(5):323–333

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by Junta de Andalusia under Projects TIC-8260 and P11-TIC-7176. Siham Tabik was supported by the Ramón y Cajal Programme (RYC-2015-18136).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Tabik.

Appendices

Appendix A

figure j
figure k

Appendix B

figure l
figure m
figure n
figure o
figure p
figure q

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tabik, S., Peemen, M. & Romero, L.F. A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study. J Supercomput 74, 1580–1608 (2018). https://doi.org/10.1007/s11227-017-2184-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2184-6

Keywords

Navigation