Abstract
This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (\(>25\%\), this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65\(\times \) faster than the case in which we fully decompose our stencil without tiling and 5.3\(\times \) faster with respect to the fully fused version on the NVIDIA GPUs.
Similar content being viewed by others
Notes
This percentage can be measured or estimated experimentally for each single stencil.
References
Whitepaper nvidias next generation cuda compute architecture: Kepler tm gk110. In NVIDIA
Barash D (2002) Fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Trans Pattern Anal Mach Intell 24(6):844–847
Dang V, El-Araby E, Dao L, Chang L-C (2013) Accelerating nonlinear diffusion tensor estimation for medical image processing using high performance GPU clusters. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 265–268
Fernandez J-J (2009) Tomobflow: feature-preserving noise filtering for electron tomography. BMC Bioinf 10(1):178
Fernandez J-J, Lawrence AF, Roca J, Garcia I, Ellisman MH, Carazo JM (2002) High-performance electron tomography of complex biological specimens. J Struct Biol 138:6–20
Fernández J-J, Li S (2003) An improved algorithm for anisotropic nonlinear diffusion for denoising cryo-tomograms. J Struct Biol 144(1):152–161
Fernandez J-J, Sam L (2005) Anisotropic nonlinear filtering of cellular structures in cryoelectron tomography. Comput Sci Eng 7(5):54–61
Filipovič J, Madzin M, Fousek J, Matyska L (2015) Optimizing cuda code by kernel fusion: application on blas. J Supercomput 71(10):3934–3957
Frangakis AS, Hegerl R (2001) Noise reduction in electron tomographic reconstructions using nonlinear anisotropic diffusion. J Struct Biol 135(3):239–250
Frangakis AS, Stoschek A, Hegerl R (2001) Wavelet transform filtering and nonlinear anisotropic diffusion assessed for signal reconstruction performance on multidimensional biomedical data. IEEE Trans Biomed Eng 48(2):213–222
Fehrenbach JMJ (2013) Small non-negative stencils for anisotropic diffusion. Numerical Analysis. arXiv:1301.3925
Fuller SH, Millett LI (2011) Computing performance: Game over or next level? Computer 1:31–38
Gysi T, Grosser T, Hoefler T (2015) Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp 177–186
Holewinski J, Pouchet L-N, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp 311–320
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp 1–12. IEEE
Kuijper A, Schwarzkopf A, Kalbe T, Bajaj CL, Roth S, Goesele M (2013) 3d anisotropic diffusion on gpus by closed-form local tensor computations. Numer Math 6:72–94
Micikevicius P (2009) 3d finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp 79–84. ACM
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not 48(6):519–530
Rumpf M, Strzodka R (2001) Nonlinear diffusion in graphics hardware. Springer, Berlin
Schfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036 (2011 Proceedings of the International Conference on Computational Science, ICCS)
Schwarzkopf A, Kalbe T, Bajaj C, Kuijper A, Goesele M (2012) Volumetric nonlinear anisotropic diffusion on GPUs. Scale Space and Variational Methods in Computer Vision. Volume 6667 of Lecture Notes in Computer Science. Springer, Berlin, pp 62–73
Tabik S, Murarasu A, Romero L (2014) Anisotropic nonlinear diffusion for filtering 3d images on gpus. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp 339–345
Tabik S, Murarasu A, Romero L (2014) Evaluating the fission/fusion transformation of an iterative multiple 3d-stencil on gpus. In: 1st Int’l Workshop on High-Performance Stencil Computations (HiStencils 2014), pp 81–88
Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the gpu: iterative solvers as case study. J Supercomput 70(2):577–587
Tabik S, Peemen M, Guil N, Corporaal H (2015) Demystifying the 16$\times $ 16 thread-block for stencils on the GPU. Concurr Comput Pract Exp 27(18):5557–557
Weickert J (1998) Anisotropic diffusion in image processing. Teubner, Stuttgart
Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for gpus using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp 2433–2442. IEEE
Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on cpu-gpu heterogeneous computing systems. J Supercomput 73(5):1760–1781
Zhao Y (2008) Lattice boltzmann based pde solver on the GPU. Vis Comput 24(5):323–333
Acknowledgements
This work was partially supported by Junta de Andalusia under Projects TIC-8260 and P11-TIC-7176. Siham Tabik was supported by the Ramón y Cajal Programme (RYC-2015-18136).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
Appendix B
Rights and permissions
About this article
Cite this article
Tabik, S., Peemen, M. & Romero, L.F. A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study. J Supercomput 74, 1580–1608 (2018). https://doi.org/10.1007/s11227-017-2184-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2184-6