Skip to main content
Log in

Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

  • Special Issue Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. The OpenCL 2.0 standard also enables dynamic parallelism on device side, but most programming environments do not support it yet.

References

  1. Adapteva, Inc (2015) Epiphany-IV 64-core 28nm Microprocessor. http://www.adapteva.com/products/silicon-devices/e64g401/

  2. Agosta, G., Barenghi, A., Pelosi, G., Scandale, M.: Towards transparently tackling functionality and performance issues across different OpenCL platforms. In: 2014 Second International Symposium on Computing and Networking (CANDAR), pp. 130–136. IEEE (2014)

  3. Ayguadé, E., Badia, R.M., Bellens, P., Cabrera, D., Duran, A., Ferrer, R., Gonzàlez, M., Igual, F., Jiménez-González, D., Labarta, J. et al.: Extending OpenMP to survive the heterogeneous multi-core era. Int. J. Parallel Program. 38, 440–459 (2010)

    Article  MATH  Google Scholar 

  4. Benini, L., Flamand, E., Fuin, D., Melpignano, D.: P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 983–987. IEEE (2012)

  5. Boudier, P., Sellers, G.: Memory system on fusion APUs: the benefits of zero copy. In: AMD Fusion Developer Summit. AMD (2011). http://www.developer.amd.com/afds/assets/presentations/1004_final.pdf

  6. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J.H., Brown, S., Czajkowski, T.: LegUp: high-level synthesis for FPGA-based processor/accelerator systems. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 33–36. ACM (2011)

  7. Canny, J.: A computational approach to edge detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, pp. 679–698. IEEE (1986)

  8. Cong, J., Liu, C., Ghodrat, M.A., Reinman, G., Gill, M., Zou, Y.: AXR-CMP: architecture support in accelerator-rich CMPs. In: 2nd Workshop on SoC Architecture, Accelerators and Workloads (2011)

  9. Cong, J., Ghodrat, M.A,, Gill, M., Grigorian, B., Reinman, G.: CHARM: a composable heterogeneous accelerator-rich microprocessor. In: Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 379–384. ACM (2012)

  10. Conti, F., Rossi, D., Pullini, A., Loi, I., Benini, L.: Energy-efficient vision on the PULP platform for ultra-low power parallel computing. In: 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. IEEE (2014)

  11. Coombs, J., Prabhu, R., Peake, G.: Overcoming the challenges of porting OpenCV to TI's embedded ARM+ DSP platforms. Int. J. Electr. Eng. Educ. 49(3), 260–274 (2012)

    Article  Google Scholar 

  12. Czajkowski, T.S., Aydonat, U., Denisenko, D., Freeman, J., Kinsner, M., Neto, D., Wong, J., Yiannacouras, P., Singh, DP.: From OpenCL to high-performance hardware on FPGAs. In: 22nd International Conference on Field Programmable Logic and Applications (FPL), pp. 531–534. IEEE (2012)

  13. Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1), 129–159 (2009)

    Article  MATH  Google Scholar 

  14. Embedded Vision Alliance (2015) Website. http://www.embedded-vision.com/

  15. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., LeCun, Y.: Neuflow: a runtime reconfigurable dataflow processor for vision. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109–116. IEEE (2011)

  16. Fatahalian, K., Horn, DR., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez , M., Ren, M., Aiken, A., Dally, W.J. et al: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 83. ACM (2006)

  17. Franceschelli, A., Burgio, P., Tagliavini, G., Marongiu, A., Ruggiero, M., Lombardi, M., Bonfietti, A., Milano, M., Benini, L.: MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 11. ACM (2011)

  18. Gehrig, S.K., Eberli, F., Meyer, T.: A real-time low-power stereo vision engine using semi-global matching. In: Computer Vision Systems, pp. 134–143. Springer (2009)

  19. Geilen, M., Basten, T., Stuijk, S.: Minimising buffer requirements of synchronous dataflow graphs with model checking. In: Proceedings of the 42nd annual Design Automation Conference, pp. 819–824. ACM (2005)

  20. Gonzàlez, M., Vujic, N., Martorell, X., Ayguadé, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 292–302. ACM (2008)

  21. Greengard, S.: Computational photography comes into focus. Commun. ACM 57(2), 19–21 (2014)

    Article  Google Scholar 

  22. Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., Hanrahan, P. Darkroom: Compiling high-level image processing code into hardware pipelines. In: Proceedings of the 41st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) (2014)

  23. Heinecke, A., Klemm, M., Bungartz, H.: From GPGPU to many-core: Nvidia fermi and intel many integrated core architecture. Comput. Sci. Eng. 14(2), 78–83 (2012)

    Article  Google Scholar 

  24. HSA Foundation Specification Library (2015). http://www.hsafoundation.com/html/HSA_Library.htm

  25. KALRAY Corporation (2015) Website. http://www.kalray.eu/

  26. Kronos Group (2015a) The OpenCL 1.1 Specifications. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

  27. Kronos Group (2015b) The OpenVX API for hardware acceleration. http://www.khronos.org/openvx

  28. Lee, H., Brown, K.J., Sujeeth, A.K., Chafi, H., Rompf, T., Odersky, M., Olukotun, K.: Implementing domain-specific languages for heterogeneous parallel computing. IEEE Micro 5, 42–53 (2011)

    Article  Google Scholar 

  29. Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 303–314. ACM (2008)

  30. Lei, Y., Gang, Z., Si-Heon, R., Choon-Young, L., Sang-Ryong, L., Bae, K.M.: The platform of image acquisition and processing system based on DSP and FPGA. In: International Conference on Smart Manufacturing Application, pp. 470–473. IEEE (2008)

  31. Lepley, T., Paulin, P., Flamand, E. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 1–10. IEEE (2013)

  32. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI, vol. 81, pp. 674–679. IJCAI Organization (1981)

  33. Maghazeh, A., Bordoloi, U.D., Eles, P., Peng, Z.: General purpose computing on low-power embedded GPUs: has it come of age? In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 1–10. IEEE (2013)

  34. Magno, M., Tombari, F., Brunelli, D., Di Stefano, L., Benini, L.: Multimodal abandoned/removed object detection for low power video surveillance systems. In: Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 188–193. IEEE (2009)

  35. Membarth, R., Reiche, O., Hannig, F., Teich, J., Korner, M., Eckert, W.: HIPAcc: a Domain-Specific Language and Compiler for Image Processing. IEEE Trans. Parallel Distrib. Syst. doi:10.1109/TPDS.2015.2394802 (2015)

  36. Movidius, L.D.T.: Myriad 1 Mobile Vision Processor. http://www.movidius.com/our-technology/myriad-2-platform/ (2015)

  37. NVIDIA (2015) Tegra Android Development Documentation Website. http://docs.nvidia.com/tegra/index.html

  38. OpenCV Library Homepage (2015) Website. http://www.opencv.com/

  39. Park, S., Maashri, A.A., Irick, K.M., Chandrashekhar, A., Cotter, M., Chandramoorthy, N., Debole, M., Narayanan, V.: System-on-chip for biologically inspired vision applications. IPSJ Trans. Syst. LSI Design Methodol. 5, 71–95 (2012)

  40. Plurality Ltd (2015) The HyperCore Processor. http://www.plurality.com/hypercore.html

  41. Qualcomm (2015) Computer Vision (FastCV). https://developer.qualcomm.com/computer-vision-fastcv

  42. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, vol. 48, pp. 519–530. ACM (2013)

  43. Rainey, E., Villarreal, J., Dedeoglu, G., Pulli, K., Lepley, T., Brill, F. Addressing System-Level Optimization with OpenVX Graphs. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 658–663. IEEE (2014)

  44. Rosten, E., Porter, R., Drummond, T.: Faster and better: a machine learning approach to corner detection. IEEE Trans. Patter. Anal. Mach. Intell. 32(1), 105–119 (2010)

    Article  Google Scholar 

  45. Schubert, F., Schertler, K., Mikolajczyk, K.: A hands-on approach to high-dynamic-range and super resolution fusion. In: 2009 Workshop on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2009)

  46. Sonka, M., Hlavac, V., Boyle, R..: Image processing, analysis, and machine vision. Thomson Toronto (2008)

  47. Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 66–73 (2010)

    Article  Google Scholar 

  48. Tagliavini, G., Haugou, G., Marongiu, A., Benini, L.: A framework for optimizing OpenVX applications performance on embedded manycore accelerators. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, pp. 125–128. ACM (2015)

  49. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: a language for streaming applications. In: Compiler Construction, pp. 179–196. Springer (2002)

  50. Vajda, A.: Programming many-core chips. Springer (2011)

  51. Wienke, S., Springer, P., Terboven, C., an Mey, D. OpenACC First Experiences with Real-World Applications. In: Euro-Par 2012 Parallel Processing, pp. 859–870. Springer (2012)

  52. Zedboard.org (2015) Zedboard product page. http://zedboard.org/product/zedboard

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Tagliavini.

Additional information

This work has been supported by the EU-funded research projects P-SOCRATES (g.a. 611016) and MULTITHERMAN (g.a. 291125).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tagliavini, G., Haugou, G., Marongiu, A. et al. Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators. J Real-Time Image Proc 15, 73–92 (2018). https://doi.org/10.1007/s11554-015-0544-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-015-0544-0

Keywords

Navigation