Skip to main content
Log in

Tuning framework for stencil computation in heterogeneous parallel platforms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Image processing and computer vision applications are usually complex in terms of the large amount of processed data and high computation loads. To cope with this, optimization techniques and high-performance hardware platforms are required. Since these applications present many opportunities for parallelism, heterogeneous parallel platforms (HPPs) are an interesting choice, offering a good balance between high computation capabilities and flexibility to handle a large spectrum of application features. Applications such as image filtering and edge detection make extensive use of finite difference method to solve partial derivative equations, which computational pattern is called stencil computation. Stencil computations are known as memory-bound, so that reducing high-latency memory access becomes the biggest challenge to reach high performance. In this paper, we present our methodology as a basis of a performance tuning framework to optimize the implementation of multiple stencil computations on HPPs. Results show that our approach outperforms SDK-based methodologies, improving performance. Moreover, using the proposed approach, the developer has the ability of investigating efficiently the performance of the stencil computations before implementing actual code on the target platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. A kernel is a function that runs on a GPU. One kernel is executed at a time and many threads execute each kernel.

  2. Occupancy rate is defined as the ratio of the number of allocated threads by the limit allowed by each streaming multiprocessor (SM).

  3. Peripheral component interconnect express.

References

  1. Arabnia H (1995) A distributed stereocorrelation algorithm. In: Fourth International conference on computer communications and networks, pp 479–482. doi:10.1109/ICCCN.1995.540163

  2. Arabnia H, Bhandarkar S (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269. doi:10.1007/BF00130109

    Article  MATH  Google Scholar 

  3. Arabnia H, Oliver M (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum. doi:10.1111/j.1467-8659.1987.tb00340.x

  4. Bhandarkar S, Arabnia H, Smith J (1995) A reconfigurable architecture for image processing and computer vision. PRAI 9:201–229

    Google Scholar 

  5. Calandra H, Dolbeau R, Fortin P, Lamotte JL, Said I (2013) Evaluation of successive CPUs/APUs/GPUs based on an OpenCL finite difference stencil. In: 21st euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405–409

  6. Cook S (2013) CUDA programming: a developer’s guide to parallel computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA. ISBN 9780124159334, 9780124159884

  7. Cuda N (2014) NVIDIA CUDA C programming guide v7.0. Tech. rep. http://www.bibsonomy.org/bibtex/2e90a6474d85eac083c921cf5be29f6ef/toevanen

  8. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, p 4

  9. Djabelkhir A, Seznec A (2003) Characterization of embedded applications for decoupled processor architecture. In: IEEE international workshop on workload characterization (WWC-6). IEEE, pp 119–127

  10. Eberhart P, Said I, Fortin P, Calandra H (2014) Hybrid strategy for stencil computations on the apu. In: Proceedings of the 1st international workshop on high-performance stencil computations, Vienna, pp 43–49

  11. Grosser T, Cohen A, Kelly PH, Ramanujam J, Sadayappan P, Verdoolaege S (2013) Split tiling for gpus: automatic parallelization using trapezoidal tiles. In: Proceedings of the 6th workshop on general purpose processor using graphics processing units. ACM, pp 24–31

  12. Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P (2007) Effective automatic parallelization of stencil computations. In: ACM sigplan notices, vol 42. ACM, pp 235–244

  13. Luper D, Cameron D, Miller J, Arabnia HR (2007) Spatial and temporal target association through semantic analysis and gps data mining. In: Arabnia HR, Hashemi RR (eds) IKE. CSREA Press, USA, pp 251–257

    Google Scholar 

  14. Lutz T, Fensch C, Cole M (2013) Partans: an autotuning framework for stencil computation on multi-gpu systems. ACM Trans Archit Code Optim (TACO) 9(4):59

    Google Scholar 

  15. Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of the 23rd international conference on supercomputing. ACM, pp 256–265

  16. Pienaar JA, Raghunathan A, Chakradhar S (2011) Mdr: performance model driven runtime for heterogeneous parallel platforms. In: Proceedings of the international conference on supercomputing. ACM, pp 225–234

  17. Rahbarinia B, Pedram M, Arabnia H, Alavi Z (2010) A multi-objective scheme to hide sequential patterns. In: The 2nd international conference on computer and automation engineering (ICCAE), vol 1, pp 153–158. doi:10.1109/ICCAE.2010.5451977

  18. Tabik S, Murarasu A, Romero LF (2014) Evaluating the fissionfusion transformation of an iterative multiple 3D-stencil on GPUs. HiStencils 2014:81

    Google Scholar 

  19. Tang WT, Tan WJ, Krishnamoorthy R, Wong YW, Kuo Sh, Goh RSM, Turner SJ, Wong WF (2013) Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method. In: IEEE 27th international symposium on parallel and distributed processing (IPDPS). IEEE, pp 452–462

  20. Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK, Leiserson CE (2011) The pochoir stencil compiler. In: Proceedings of the twenty-third annual ACM symposium on parallelism in algorithms and architectures. ACM, pp 117–128

  21. Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: IEEE 26th international on parallel and distributed processing symposium workshops and PhD forum (IPDPSW). IEEE, pp 2433–2442

  22. Xu C, Kirk SR, Jenkins S (2009) Tiling for performance tuning on different models of gpus. In: Proceedings of the 2009 second international symposium on information science and engineering (ISISE’09). IEEE Computer Society, Washington, DC, pp 500–504. doi:10.1109/ISISE.2009.60

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Nicolescu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheikh, T.L.B., Aguiar, A., Tahar, S. et al. Tuning framework for stencil computation in heterogeneous parallel platforms. J Supercomput 72, 468–502 (2016). https://doi.org/10.1007/s11227-015-1575-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1575-9

Keywords

Navigation