High-performance code optimizations for mobile devices

Afonso, Sergio; Acosta, Alejandro; Almeida, Francisco

doi:10.1007/s11227-018-2638-5

High-performance code optimizations for mobile devices

Published: 11 October 2018

Volume 75, pages 1382–1395, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

258 Accesses
2 Citations
Explore all metrics

Abstract

Mobile devices have seen their performance increased in latest years due to improvements on System on Chip technologies. These shared memory systems now integrate multicore CPUs and accelerators, and obtaining the optimal performance from such heterogeneous architectures requires making use of accelerators in an efficient way. Graphics Processing Units (GPUs) are accelerators that often outperform multicore CPUs in data-parallel workloads by orders of magnitude, so their use for image processing applications on mobile devices is very important. In this work we explore tiling code optimizations for GPU applications running on mobile devices. A dynamic adaptive tile size selection methodology is created, which allows finding at runtime close-to-optimal parameterizations independently of the underlying architecture. Results demonstrate the performance benefits of these optimizations over a set of stencil-based image processing benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Containers in HPC: a survey

Article 27 October 2022

MT-3000: a heterogeneous multi-zone processor for HPC

Article 24 May 2022

Notes

Single Instruction Multiple Threads.

References

Acosta A, Almeida F (2015) Towards the optimal execution of renderscript applications in android devices. Simul Model Pract Theory 58:55–64. https://doi.org/10.1016/j.simpat.2015.05.006
Article Google Scholar
Afonso S, Acosta A, Almeida F (2017) Automatic acceleration of stencil codes in android devices, pp. 81–95. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-65482-9_6
Almeida F, Andonov R, González D, Moreno LM, Poirriez V, Rodríguez C (2002) Optimal tiling for the RNA base pairing problem. In: SPAA, pp. 173–182. https://doi.org/10.1145/564870.564901
Andonov R, Rajopadhye S (1997) Optimal orthogonal tiling of 2-d iterations. J Parallel Distrib Comput 45(2):159–165. https://doi.org/10.1006/jpdc.1997.1371
Article MATH Google Scholar
ARM: Mali graphics and multimedia processors. https://developer.arm.com/products/graphics-and-multimedia/mali-gpus
Boratto M, Alonso P, Giménez D, Barreto M (2013) Oliveira K Auto-tuning methodology to represent landform attributes on multicore and multi-gpu systems. In: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’13, pp. 125–132. ACM, New York, NY, USA. https://doi.org/10.1145/2442992.2443006
Boratto M, Alonso P, Giménez D, Lastovetsky A (2017) Automatic tuning to performance modelling of matrix polynomials on multicore and multi-gpu systems. J Supercomput 73(1):227–239. https://doi.org/10.1007/s11227-016-1694-y
Article Google Scholar
Chu SL, Hsiao CC (2013) Methods for optimizing opencl applications on heterogeneous multicore architectures. Appl Math Inf Sci 7(6):2549
Article Google Scholar
García LP, Cuenca J, Giménez D (2007) Including improvement of the execution time in a software architecture of libraries with self-optimisation. In: ICSOFT (SE), pp. 156–161. Citeseer
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM
Imagination: A quick guide to writing OpenCL kernels for PowerVR Rogue GPUs. https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/. Accessed 9 Oct 2018
Magni A, Dubach C, O’Boyle MFP (2013) A large-scale cross-architecture evaluation of thread-coarsening. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 11:1–11:11. ACM, New York, NY, USA. https://doi.org/10.1145/2503210.2503268
Qualcomm: Adreno GPU SDK. https://developer.qualcomm.com/software/adreno-gpu-sdk. Accessed 9 Oct 2018
Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. 48(6):519–530. https://doi.org/10.1145/2499370.2462176
Article Google Scholar
Rocha RCO, Pereira AD, Ramos L, Góes LFW (2017) Toast: automatic tiling for iterative stencil computations on gpus. Concurr Comput Pract Exp 29(8):4053. https://doi.org/10.1002/cpe.4053
Article Google Scholar
Shen J, Fang J, Sips H, Varbanescu AL (2013) Performance traps in opencl for cpus. In: 2013 21st Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 38–45. IEEE
StatCounter: Mobile operating system market share worldwide. http://gs.statcounter.com/os-market-share/mobile/worldwide/2017. Accessed 9 Oct 2018
Vivante: Vivante Vega GPGPU technology. http://www.vivantecorp.com/index.php/en/technology/gpgpu.html. Accessed 9 Oct 2018
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1):3–35. https://doi.org/10.1016/S0167-8191(00)00087-9
Wolfe M (1989) More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, pp. 655–664. ACM, New York, NY, USA. https://doi.org/10.1145/76263.76337
Zhang Y, Sinclair M, Chien AA (2013) Improving performance portability in opencl programs. In: ISC, pp. 136–150. Springer

Download references

Author information

Authors and Affiliations

Department of Computer Engineering and Systems, Escuela Superior de Ingeniería y Tecnología, Universidad de La Laguna, 38200, Santa Cruz de Tenerife, Spain
Sergio Afonso, Alejandro Acosta & Francisco Almeida

Authors

Sergio Afonso
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Acosta
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Almeida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio Afonso.

Additional information

This work was supported by the Ministry of Science, Innovation and Universities through the project TIN2016-78919-R and the Grant Number FPU16/00942, by the Government of the Canary Islands through the project ProID2017010130, by the CAPAP-H network and by the cHiPSet COST Action.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afonso, S., Acosta, A. & Almeida, F. High-performance code optimizations for mobile devices. J Supercomput 75, 1382–1395 (2019). https://doi.org/10.1007/s11227-018-2638-5

Download citation

Published: 11 October 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s11227-018-2638-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-performance code optimizations for mobile devices

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Containers in HPC: a survey

MT-3000: a heterogeneous multi-zone processor for HPC

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-performance code optimizations for mobile devices

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Containers in HPC: a survey

MT-3000: a heterogeneous multi-zone processor for HPC

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation