Abstract
Mobile devices have seen their performance increased in latest years due to improvements on System on Chip technologies. These shared memory systems now integrate multicore CPUs and accelerators, and obtaining the optimal performance from such heterogeneous architectures requires making use of accelerators in an efficient way. Graphics Processing Units (GPUs) are accelerators that often outperform multicore CPUs in data-parallel workloads by orders of magnitude, so their use for image processing applications on mobile devices is very important. In this work we explore tiling code optimizations for GPU applications running on mobile devices. A dynamic adaptive tile size selection methodology is created, which allows finding at runtime close-to-optimal parameterizations independently of the underlying architecture. Results demonstrate the performance benefits of these optimizations over a set of stencil-based image processing benchmarks.










Similar content being viewed by others
Notes
Single Instruction Multiple Threads.
References
Acosta A, Almeida F (2015) Towards the optimal execution of renderscript applications in android devices. Simul Model Pract Theory 58:55–64. https://doi.org/10.1016/j.simpat.2015.05.006
Afonso S, Acosta A, Almeida F (2017) Automatic acceleration of stencil codes in android devices, pp. 81–95. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-65482-9_6
Almeida F, Andonov R, González D, Moreno LM, Poirriez V, Rodríguez C (2002) Optimal tiling for the RNA base pairing problem. In: SPAA, pp. 173–182. https://doi.org/10.1145/564870.564901
Andonov R, Rajopadhye S (1997) Optimal orthogonal tiling of 2-d iterations. J Parallel Distrib Comput 45(2):159–165. https://doi.org/10.1006/jpdc.1997.1371
ARM: Mali graphics and multimedia processors. https://developer.arm.com/products/graphics-and-multimedia/mali-gpus
Boratto M, Alonso P, Giménez D, Barreto M (2013) Oliveira K Auto-tuning methodology to represent landform attributes on multicore and multi-gpu systems. In: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’13, pp. 125–132. ACM, New York, NY, USA. https://doi.org/10.1145/2442992.2443006
Boratto M, Alonso P, Giménez D, Lastovetsky A (2017) Automatic tuning to performance modelling of matrix polynomials on multicore and multi-gpu systems. J Supercomput 73(1):227–239. https://doi.org/10.1007/s11227-016-1694-y
Chu SL, Hsiao CC (2013) Methods for optimizing opencl applications on heterogeneous multicore architectures. Appl Math Inf Sci 7(6):2549
García LP, Cuenca J, Giménez D (2007) Including improvement of the execution time in a software architecture of libraries with self-optimisation. In: ICSOFT (SE), pp. 156–161. Citeseer
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM
Imagination: A quick guide to writing OpenCL kernels for PowerVR Rogue GPUs. https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/. Accessed 9 Oct 2018
Magni A, Dubach C, O’Boyle MFP (2013) A large-scale cross-architecture evaluation of thread-coarsening. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 11:1–11:11. ACM, New York, NY, USA. https://doi.org/10.1145/2503210.2503268
Qualcomm: Adreno GPU SDK. https://developer.qualcomm.com/software/adreno-gpu-sdk. Accessed 9 Oct 2018
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. 48(6):519–530. https://doi.org/10.1145/2499370.2462176
Rocha RCO, Pereira AD, Ramos L, Góes LFW (2017) Toast: automatic tiling for iterative stencil computations on gpus. Concurr Comput Pract Exp 29(8):4053. https://doi.org/10.1002/cpe.4053
Shen J, Fang J, Sips H, Varbanescu AL (2013) Performance traps in opencl for cpus. In: 2013 21st Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 38–45. IEEE
StatCounter: Mobile operating system market share worldwide. http://gs.statcounter.com/os-market-share/mobile/worldwide/2017. Accessed 9 Oct 2018
Vivante: Vivante Vega GPGPU technology. http://www.vivantecorp.com/index.php/en/technology/gpgpu.html. Accessed 9 Oct 2018
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1):3–35. https://doi.org/10.1016/S0167-8191(00)00087-9
Wolfe M (1989) More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, pp. 655–664. ACM, New York, NY, USA. https://doi.org/10.1145/76263.76337
Zhang Y, Sinclair M, Chien AA (2013) Improving performance portability in opencl programs. In: ISC, pp. 136–150. Springer
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Ministry of Science, Innovation and Universities through the project TIN2016-78919-R and the Grant Number FPU16/00942, by the Government of the Canary Islands through the project ProID2017010130, by the CAPAP-H network and by the cHiPSet COST Action.
Rights and permissions
About this article
Cite this article
Afonso, S., Acosta, A. & Almeida, F. High-performance code optimizations for mobile devices. J Supercomput 75, 1382–1395 (2019). https://doi.org/10.1007/s11227-018-2638-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2638-5