Skip to main content

Optimizing GPU Code for CPU Execution Using OpenCL and Vectorization: A Case Study on Image Coding

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2016)

Abstract

Although OpenCL aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based versions: (i) a regular one (OpenCL-CPU) and (ii) a CPU vector-based one (OpenCL-CPU-Vect). The use of CPU vectorization exploits the OpenCL support, making it much simpler than directly coding with SIMD instructions such as SSE and AVX. Globally, while the OpenCL-GPU version is the fastest when run on a high end GPU requiring around 580 s to encode the Lenna image, its performance drops roughly 65 % when run unchanged on a multicore CPU machine. For the CPU tuned versions, OpenCL-CPU encodes the Lenna image in 805 s, while the vectorization-based approach executes the same operation in 672 s. Results show that meaningful performance gains can be achieved by tailoring the OpenCL code to the CPU, and that the use of CPU vectorization instructions through OpenCL is both rather simple and performance rewarding.

S.M.M. de Faria—Financial support provided in the scope of R&D Unit 50008, financed by FCT/MEC through national funds and co-funded by FEDER - PT2020 partnership agreement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: ICHPCNSA, p. 13. ACM (2011)

    Google Scholar 

  2. Domingues, P., Silva, J., Ribeiro, T., Rodrigues, N.M.M., Carvalho, M.B., Faria, S.M.M.: Optimizing memory usage and accesses on CUDA-based recurrent pattern matching image compression. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 560–575. Springer, Heidelberg (2014). doi:10.1007/978-3-319-09147-1_41

    Google Scholar 

  3. Dong, H., Ghosh, D., Zafar, F., Zhou, S.: Cross-platform OpenCL code and performance portability investigated with a climate and weather physics model. In: 41st International Conference on ICPPW, pp. 126–134. IEEE (2012)

    Google Scholar 

  4. Hwu, W.M.: What is ahead for parallel computing. J. Parallel Distrib. Comput. 74(7), 2574–2581 (2014)

    Article  Google Scholar 

  5. Intel: Writing Optimal OpenCL Code with Intel OpenCL SDK (2011)

    Google Scholar 

  6. Jang, B., Schaa, D., Mistry, P., Kaeli, D.: Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst. 22(1), 105–118 (2011)

    Article  Google Scholar 

  7. Lee, J.H., Patel, K., Nigania, N., Kim, H., Kim, H.: OpenCL performance evaluation on modern multi core CPUs. In: IEEE 27th International Symposium on IPDPS, pp. 1177–1185. IEEE (2013)

    Google Scholar 

  8. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth. SIGARCH Comput. Archit. News 38, 451–460 (2010)

    Article  Google Scholar 

  9. Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: IEEE 27th International Symposium on IPDPS, pp. 1107–1116. IEEE (2013)

    Google Scholar 

  10. Rodrigues, N.M., da Silva, E.A., de Carvalho, M.B., de Faria, S.M., da Silva, V.M.M.: On dictionary adaptation for recurrent pattern image coding. IEEE Trans. Image Proces. 17(9), 1640–1653 (2008)

    Article  MathSciNet  Google Scholar 

  11. Shen, J., Fang, J., Sips, H., Varbanescu, A.L.: Performance traps in OpenCL for CPUs. In: PDP2013, pp. 38–45. IEEE (2013)

    Google Scholar 

  12. Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergio M. M. de Faria .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Pereira, P.M.M., Domingues, P., Rodrigues, N.M.M., Falcao, G., de Faria, S.M.M. (2016). Optimizing GPU Code for CPU Execution Using OpenCL and Vectorization: A Case Study on Image Coding. In: Carretero, J., Garcia-Blas, J., Ko, R., Mueller, P., Nakano, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10048. Springer, Cham. https://doi.org/10.1007/978-3-319-49583-5_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49583-5_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49582-8

  • Online ISBN: 978-3-319-49583-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics