Abstract
Image processing and computer vision applications are usually complex in terms of the large amount of processed data and high computation loads. To cope with this, optimization techniques and high-performance hardware platforms are required. Since these applications present many opportunities for parallelism, heterogeneous parallel platforms (HPPs) are an interesting choice, offering a good balance between high computation capabilities and flexibility to handle a large spectrum of application features. Applications such as image filtering and edge detection make extensive use of finite difference method to solve partial derivative equations, which computational pattern is called stencil computation. Stencil computations are known as memory-bound, so that reducing high-latency memory access becomes the biggest challenge to reach high performance. In this paper, we present our methodology as a basis of a performance tuning framework to optimize the implementation of multiple stencil computations on HPPs. Results show that our approach outperforms SDK-based methodologies, improving performance. Moreover, using the proposed approach, the developer has the ability of investigating efficiently the performance of the stencil computations before implementing actual code on the target platforms.
Similar content being viewed by others
Notes
A kernel is a function that runs on a GPU. One kernel is executed at a time and many threads execute each kernel.
Occupancy rate is defined as the ratio of the number of allocated threads by the limit allowed by each streaming multiprocessor (SM).
Peripheral component interconnect express.
References
Arabnia H (1995) A distributed stereocorrelation algorithm. In: Fourth International conference on computer communications and networks, pp 479–482. doi:10.1109/ICCCN.1995.540163
Arabnia H, Bhandarkar S (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269. doi:10.1007/BF00130109
Arabnia H, Oliver M (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum. doi:10.1111/j.1467-8659.1987.tb00340.x
Bhandarkar S, Arabnia H, Smith J (1995) A reconfigurable architecture for image processing and computer vision. PRAI 9:201–229
Calandra H, Dolbeau R, Fortin P, Lamotte JL, Said I (2013) Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil. In: 21st euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405–409
Cook S (2013) CUDA programming: a developer’s guide to parallel computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA. ISBN 9780124159334, 9780124159884
Cuda N (2014) NVIDIA CUDA C programming guide v7.0. Tech. rep. http://www.bibsonomy.org/bibtex/2e90a6474d85eac083c921cf5be29f6ef/toevanen
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, p 4
Djabelkhir A, Seznec A (2003) Characterization of embedded applications for decoupled processor architecture. In: IEEE international workshop on workload characterization (WWC-6). IEEE, pp 119–127
Eberhart P, Said I, Fortin P, Calandra H (2014) Hybrid strategy for stencil computations on the apu. In: Proceedings of the 1st international workshop on high-performance stencil computations, Vienna, pp 43–49
Grosser T, Cohen A, Kelly PH, Ramanujam J, Sadayappan P, Verdoolaege S (2013) Split tiling for gpus: automatic parallelization using trapezoidal tiles. In: Proceedings of the 6th workshop on general purpose processor using graphics processing units. ACM, pp 24–31
Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P (2007) Effective automatic parallelization of stencil computations. In: ACM sigplan notices, vol 42. ACM, pp 235–244
Luper D, Cameron D, Miller J, Arabnia HR (2007) Spatial and temporal target association through semantic analysis and gps data mining. In: Arabnia HR, Hashemi RR (eds) IKE. CSREA Press, USA, pp 251–257
Lutz T, Fensch C, Cole M (2013) Partans: an autotuning framework for stencil computation on multi-gpu systems. ACM Trans Archit Code Optim (TACO) 9(4):59
Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of the 23rd international conference on supercomputing. ACM, pp 256–265
Pienaar JA, Raghunathan A, Chakradhar S (2011) Mdr: performance model driven runtime for heterogeneous parallel platforms. In: Proceedings of the international conference on supercomputing. ACM, pp 225–234
Rahbarinia B, Pedram M, Arabnia H, Alavi Z (2010) A multi-objective scheme to hide sequential patterns. In: The 2nd international conference on computer and automation engineering (ICCAE), vol 1, pp 153–158. doi:10.1109/ICCAE.2010.5451977
Tabik S, Murarasu A, Romero LF (2014) Evaluating the fissionfusion transformation of an iterative multiple 3D-stencil on GPUs. HiStencils 2014:81
Tang WT, Tan WJ, Krishnamoorthy R, Wong YW, Kuo Sh, Goh RSM, Turner SJ, Wong WF (2013) Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method. In: IEEE 27th international symposium on parallel and distributed processing (IPDPS). IEEE, pp 452–462
Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK, Leiserson CE (2011) The pochoir stencil compiler. In: Proceedings of the twenty-third annual ACM symposium on parallelism in algorithms and architectures. ACM, pp 117–128
Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: IEEE 26th international on parallel and distributed processing symposium workshops and PhD forum (IPDPSW). IEEE, pp 2433–2442
Xu C, Kirk SR, Jenkins S (2009) Tiling for performance tuning on different models of gpus. In: Proceedings of the 2009 second international symposium on information science and engineering (ISISE’09). IEEE Computer Society, Washington, DC, pp 500–504. doi:10.1109/ISISE.2009.60
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheikh, T.L.B., Aguiar, A., Tahar, S. et al. Tuning framework for stencil computation in heterogeneous parallel platforms. J Supercomput 72, 468–502 (2016). https://doi.org/10.1007/s11227-015-1575-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1575-9