Optimizing GPU Code for CPU Execution Using OpenCL and Vectorization: A Case Study on Image Coding

Pereira, Pedro M. M.; Domingues, Patricio; Rodrigues, Nuno M. M.; Falcao, Gabriel; de Faria, Sergio M. M.

doi:10.1007/978-3-319-49583-5_42

Pedro M. M. Pereira^18,19,
Patricio Domingues^18,19,
Nuno M. M. Rodrigues^18,19,
Gabriel Falcao^19,20 &
…
Sergio M. M. de Faria^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10048))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1745 Accesses
1 Citations

Abstract

Although OpenCL aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based versions: (i) a regular one (OpenCL-CPU) and (ii) a CPU vector-based one (OpenCL-CPU-Vect). The use of CPU vectorization exploits the OpenCL support, making it much simpler than directly coding with SIMD instructions such as SSE and AVX. Globally, while the OpenCL-GPU version is the fastest when run on a high end GPU requiring around 580 s to encode the Lenna image, its performance drops roughly 65 % when run unchanged on a multicore CPU machine. For the CPU tuned versions, OpenCL-CPU encodes the Lenna image in 805 s, while the vectorization-based approach executes the same operation in 672 s. Results show that meaningful performance gains can be achieved by tailoring the OpenCL code to the CPU, and that the use of CPU vectorization instructions through OpenCL is both rather simple and performance rewarding.

S.M.M. de Faria—Financial support provided in the scope of R&D Unit 50008, financed by FCT/MEC through national funds and co-funded by FEDER - PT2020 partnership agreement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: ICHPCNSA, p. 13. ACM (2011)
Google Scholar
Domingues, P., Silva, J., Ribeiro, T., Rodrigues, N.M.M., Carvalho, M.B., Faria, S.M.M.: Optimizing memory usage and accesses on CUDA-based recurrent pattern matching image compression. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 560–575. Springer, Heidelberg (2014). doi:10.1007/978-3-319-09147-1_41
Google Scholar
Dong, H., Ghosh, D., Zafar, F., Zhou, S.: Cross-platform OpenCL code and performance portability investigated with a climate and weather physics model. In: 41st International Conference on ICPPW, pp. 126–134. IEEE (2012)
Google Scholar
Hwu, W.M.: What is ahead for parallel computing. J. Parallel Distrib. Comput. 74(7), 2574–2581 (2014)
Article Google Scholar
Intel: Writing Optimal OpenCL Code with Intel OpenCL SDK (2011)
Google Scholar
Jang, B., Schaa, D., Mistry, P., Kaeli, D.: Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst. 22(1), 105–118 (2011)
Article Google Scholar
Lee, J.H., Patel, K., Nigania, N., Kim, H., Kim, H.: OpenCL performance evaluation on modern multi core CPUs. In: IEEE 27th International Symposium on IPDPS, pp. 1177–1185. IEEE (2013)
Google Scholar
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth. SIGARCH Comput. Archit. News 38, 451–460 (2010)
Article Google Scholar
Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: IEEE 27th International Symposium on IPDPS, pp. 1107–1116. IEEE (2013)
Google Scholar
Rodrigues, N.M., da Silva, E.A., de Carvalho, M.B., de Faria, S.M., da Silva, V.M.M.: On dictionary adaptation for recurrent pattern image coding. IEEE Trans. Image Proces. 17(9), 1640–1653 (2008)
Article MathSciNet Google Scholar
Shen, J., Fang, J., Sips, H., Varbanescu, A.L.: Performance traps in OpenCL for CPUs. In: PDP2013, pp. 38–45. IEEE (2013)
Google Scholar
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Management and Technology, Polytechnic Institute of Leiria, Leiria, Portugal
Pedro M. M. Pereira, Patricio Domingues, Nuno M. M. Rodrigues & Sergio M. M. de Faria
Instituto de Telecomunicações, Lisbon, Portugal
Pedro M. M. Pereira, Patricio Domingues, Nuno M. M. Rodrigues, Gabriel Falcao & Sergio M. M. de Faria
Department of Electrical and Computer Engineering, University of Coimbra, Coimbra, Portugal
Gabriel Falcao

Authors

Pedro M. M. Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Patricio Domingues
View author publications
You can also search for this author in PubMed Google Scholar
Nuno M. M. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Falcao
View author publications
You can also search for this author in PubMed Google Scholar
Sergio M. M. de Faria
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio M. M. de Faria .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Leganes, Spain
Jesus Carretero
Carlos III University of Madrid, Leganes, Madrid, Spain
Javier Garcia-Blas
The University of Waikato, Hamilton, New Zealand
Ryan K.L. Ko
IBM Zurich Research Laboratory, Rüschlikon, Switzerland
Peter Mueller
Hiroshima University, Higashi-Hiroshima, Japan
Koji Nakano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pereira, P.M.M., Domingues, P., Rodrigues, N.M.M., Falcao, G., de Faria, S.M.M. (2016). Optimizing GPU Code for CPU Execution Using OpenCL and Vectorization: A Case Study on Image Coding. In: Carretero, J., Garcia-Blas, J., Ko, R., Mueller, P., Nakano, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10048. Springer, Cham. https://doi.org/10.1007/978-3-319-49583-5_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-49583-5_42
Published: 25 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49582-8
Online ISBN: 978-3-319-49583-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics