Skip to main content
Log in

High-performance code optimizations for mobile devices

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Mobile devices have seen their performance increased in latest years due to improvements on System on Chip technologies. These shared memory systems now integrate multicore CPUs and accelerators, and obtaining the optimal performance from such heterogeneous architectures requires making use of accelerators in an efficient way. Graphics Processing Units (GPUs) are accelerators that often outperform multicore CPUs in data-parallel workloads by orders of magnitude, so their use for image processing applications on mobile devices is very important. In this work we explore tiling code optimizations for GPU applications running on mobile devices. A dynamic adaptive tile size selection methodology is created, which allows finding at runtime close-to-optimal parameterizations independently of the underlying architecture. Results demonstrate the performance benefits of these optimizations over a set of stencil-based image processing benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Single Instruction Multiple Threads.

References

  1. Acosta A, Almeida F (2015) Towards the optimal execution of renderscript applications in android devices. Simul Model Pract Theory 58:55–64. https://doi.org/10.1016/j.simpat.2015.05.006

    Article  Google Scholar 

  2. Afonso S, Acosta A, Almeida F (2017) Automatic acceleration of stencil codes in android devices, pp. 81–95. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-65482-9_6

  3. Almeida F, Andonov R, González D, Moreno LM, Poirriez V, Rodríguez C (2002) Optimal tiling for the RNA base pairing problem. In: SPAA, pp. 173–182. https://doi.org/10.1145/564870.564901

  4. Andonov R, Rajopadhye S (1997) Optimal orthogonal tiling of 2-d iterations. J Parallel Distrib Comput 45(2):159–165. https://doi.org/10.1006/jpdc.1997.1371

    Article  MATH  Google Scholar 

  5. ARM: Mali graphics and multimedia processors. https://developer.arm.com/products/graphics-and-multimedia/mali-gpus

  6. Boratto M, Alonso P, Giménez D, Barreto M (2013) Oliveira K Auto-tuning methodology to represent landform attributes on multicore and multi-gpu systems. In: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’13, pp. 125–132. ACM, New York, NY, USA. https://doi.org/10.1145/2442992.2443006

  7. Boratto M, Alonso P, Giménez D, Lastovetsky A (2017) Automatic tuning to performance modelling of matrix polynomials on multicore and multi-gpu systems. J Supercomput 73(1):227–239. https://doi.org/10.1007/s11227-016-1694-y

    Article  Google Scholar 

  8. Chu SL, Hsiao CC (2013) Methods for optimizing opencl applications on heterogeneous multicore architectures. Appl Math Inf Sci 7(6):2549

    Article  Google Scholar 

  9. García LP, Cuenca J, Giménez D (2007) Including improvement of the execution time in a software architecture of libraries with self-optimisation. In: ICSOFT (SE), pp. 156–161. Citeseer

  10. Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM

  11. Imagination: A quick guide to writing OpenCL kernels for PowerVR Rogue GPUs. https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/. Accessed 9 Oct 2018

  12. Magni A, Dubach C, O’Boyle MFP (2013) A large-scale cross-architecture evaluation of thread-coarsening. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 11:1–11:11. ACM, New York, NY, USA. https://doi.org/10.1145/2503210.2503268

  13. Qualcomm: Adreno GPU SDK. https://developer.qualcomm.com/software/adreno-gpu-sdk. Accessed 9 Oct 2018

  14. Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. 48(6):519–530. https://doi.org/10.1145/2499370.2462176

    Article  Google Scholar 

  15. Rocha RCO, Pereira AD, Ramos L, Góes LFW (2017) Toast: automatic tiling for iterative stencil computations on gpus. Concurr Comput Pract Exp 29(8):4053. https://doi.org/10.1002/cpe.4053

    Article  Google Scholar 

  16. Shen J, Fang J, Sips H, Varbanescu AL (2013) Performance traps in opencl for cpus. In: 2013 21st Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 38–45. IEEE

  17. StatCounter: Mobile operating system market share worldwide. http://gs.statcounter.com/os-market-share/mobile/worldwide/2017. Accessed 9 Oct 2018

  18. Vivante: Vivante Vega GPGPU technology. http://www.vivantecorp.com/index.php/en/technology/gpgpu.html. Accessed 9 Oct 2018

  19. Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1):3–35. https://doi.org/10.1016/S0167-8191(00)00087-9

  20. Wolfe M (1989) More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, pp. 655–664. ACM, New York, NY, USA. https://doi.org/10.1145/76263.76337

  21. Zhang Y, Sinclair M, Chien AA (2013) Improving performance portability in opencl programs. In: ISC, pp. 136–150. Springer

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergio Afonso.

Additional information

This work was supported by the Ministry of Science, Innovation and Universities through the project TIN2016-78919-R and the Grant Number FPU16/00942, by the Government of the Canary Islands through the project ProID2017010130, by the CAPAP-H network and by the cHiPSet COST Action.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Afonso, S., Acosta, A. & Almeida, F. High-performance code optimizations for mobile devices. J Supercomput 75, 1382–1395 (2019). https://doi.org/10.1007/s11227-018-2638-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2638-5

Keywords

Navigation