Abstract
Although OpenCL aims to achieve portability at the code level, different hardware platforms requires different approaches in order to extract the best performance for OpenCL-based code. In this work, we use an image encoder originally tuned for OpenCL on GPU (OpenCL-GPU), and optimize it for multi-CPU based platforms. We produce two OpenCL-based versions: (i) a regular one (OpenCL-CPU) and (ii) a CPU vector-based one (OpenCL-CPU-Vect). The use of CPU vectorization exploits the OpenCL support, making it much simpler than directly coding with SIMD instructions such as SSE and AVX. Globally, while the OpenCL-GPU version is the fastest when run on a high end GPU requiring around 580 s to encode the Lenna image, its performance drops roughly 65 % when run unchanged on a multicore CPU machine. For the CPU tuned versions, OpenCL-CPU encodes the Lenna image in 805 s, while the vectorization-based approach executes the same operation in 672 s. Results show that meaningful performance gains can be achieved by tailoring the OpenCL code to the CPU, and that the use of CPU vectorization instructions through OpenCL is both rather simple and performance rewarding.
S.M.M. de Faria—Financial support provided in the scope of R&D Unit 50008, financed by FCT/MEC through national funds and co-funded by FEDER - PT2020 partnership agreement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: ICHPCNSA, p. 13. ACM (2011)
Domingues, P., Silva, J., Ribeiro, T., Rodrigues, N.M.M., Carvalho, M.B., Faria, S.M.M.: Optimizing memory usage and accesses on CUDA-based recurrent pattern matching image compression. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 560–575. Springer, Heidelberg (2014). doi:10.1007/978-3-319-09147-1_41
Dong, H., Ghosh, D., Zafar, F., Zhou, S.: Cross-platform OpenCL code and performance portability investigated with a climate and weather physics model. In: 41st International Conference on ICPPW, pp. 126–134. IEEE (2012)
Hwu, W.M.: What is ahead for parallel computing. J. Parallel Distrib. Comput. 74(7), 2574–2581 (2014)
Intel: Writing Optimal OpenCL Code with Intel OpenCL SDK (2011)
Jang, B., Schaa, D., Mistry, P., Kaeli, D.: Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst. 22(1), 105–118 (2011)
Lee, J.H., Patel, K., Nigania, N., Kim, H., Kim, H.: OpenCL performance evaluation on modern multi core CPUs. In: IEEE 27th International Symposium on IPDPS, pp. 1177–1185. IEEE (2013)
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth. SIGARCH Comput. Archit. News 38, 451–460 (2010)
Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: IEEE 27th International Symposium on IPDPS, pp. 1107–1116. IEEE (2013)
Rodrigues, N.M., da Silva, E.A., de Carvalho, M.B., de Faria, S.M., da Silva, V.M.M.: On dictionary adaptation for recurrent pattern image coding. IEEE Trans. Image Proces. 17(9), 1640–1653 (2008)
Shen, J., Fang, J., Sips, H., Varbanescu, A.L.: Performance traps in OpenCL for CPUs. In: PDP2013, pp. 38–45. IEEE (2013)
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Pereira, P.M.M., Domingues, P., Rodrigues, N.M.M., Falcao, G., de Faria, S.M.M. (2016). Optimizing GPU Code for CPU Execution Using OpenCL and Vectorization: A Case Study on Image Coding. In: Carretero, J., Garcia-Blas, J., Ko, R., Mueller, P., Nakano, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10048. Springer, Cham. https://doi.org/10.1007/978-3-319-49583-5_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-49583-5_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49582-8
Online ISBN: 978-3-319-49583-5
eBook Packages: Computer ScienceComputer Science (R0)