Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Andión, José M.; Arenaz, Manuel; Bodin, François; Rodríguez, Gabriel; Touriño, Juan

doi:10.1007/s10766-015-0362-9

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Published: 20 March 2015

Volume 44, pages 620–643, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

José M. Andión¹,
Manuel Arenaz¹,
François Bodin²,
Gabriel Rodríguez¹ &
…
Juan Touriño¹

377 Accesses
8 Citations
2 Altmetric
Explore all metrics

Abstract

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

References

Andión, J.M., Arenaz, M., Rodríguez, G., Touriño, J.: A novel compiler support for automatic parallelization on multicore systems. Parallel Comput. 39(9), 442–460 (2013)
Article Google Scholar
Andrade, D., Arenaz, M., Fraguela, B.B., Touriño, J., Doallo, R.: Automated and accurate cache behavior analysis for codes with irregular access patterns. Concurr. Comput. Pract. Exp. 19(18), 2407–2423 (2007)
Article Google Scholar
Appentra Solutions: Parallware for OpenACC. http://www.appentra.com/products/parallware/. Accessed 31 Jan 2015
Arenaz, M., Touriño, J., Doallo, R.: Compiler support for parallel code generation through kernel recognition. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), Santa Fe, NM, USA, p. 79b. IEEE (2004)
Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1–32:56 (2008)
Article Google Scholar
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Proceedings of the 19th International Conference on Compiler Construction (CC), Paphos, Cyprus, LNCS, vol. 6011, pp. 244–263. Springer (2010)
BLAS: Basic Linear Algebra Subprograms. http://www.netlib.org/blas/. Accessed 31 Jan 2015
Bodin, F., Bihan, S.: Heterogeneous multicore parallel programming for graphics processing units. Sci. Program. 17(4), 325–336 (2009)
Google Scholar
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th Conference on Programming Language Design and Implementation (PLDI), Tucson, AZ, USA, pp. 101–113. ACM (2008)
Christen, M., Schenk, O., Burkhart, H.: Automatic code generation and tuning for stencil kernels on modern shared memory architectures. Comp. Sci. Res. Dev. 26(3–4), 205–210 (2011)
Article Google Scholar
Eigenmann, R., Hoeflinger, J., Li, Z., Padua, D.A.: Experience in the automatic parallelization of four perfect-benchmark programs. In: Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Santa Clara, CA, USA, LNCS, vol. 589, pp. 65–83. Springer (1992)
Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of Innovative Parallel Computing (InPar), San Jose, CA, USA, pp. 1–10. IEEE (2012)
Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)
Article Google Scholar
HPC Project: Par4All. http://www.par4all.org/. Accessed 31 Jan 2015
Intel Corporation: Intel Math Kernel Library. http://software.intel.com/intel-mkl/. Accessed 31 Jan 2015
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU–GPU architectures. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA, pp. 165–174. ACM (2012)
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU–GPU communication management and optimization. In: Proceedings of the 32nd Conference on Programming Language Design and Implementation (PLDI), San Jose, CA, USA, pp. 142–151. ACM (2011)
Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)
Article Google Scholar
Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proceedings of the 14th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA, p. 55. ACM (2001)
Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the 23rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, LA, USA, pp. 1–11. IEEE (2010)
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, USA, pp. 23:1–23:11. IEEE (2012)
Novatte Pte. Ltd.: CAPS Compilers. http://www.novatte.com/component/content/article/126-products/hpcclusters/301-caps-compilers-for-cuda-and-opencl/. Accessed 31 Jan 2015
NVIDIA Corporation: Cg Toolkit. http://developer.nvidia.com/Cg/. Accessed 31 Jan 2015
NVIDIA Corporation: CUBLAS Library. https://developer.nvidia.com/cublas/. Accessed 31 Jan 2015
NVIDIA Corporation: CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/. Accessed 31 Jan 2015
NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed 31 Jan 2015
OpenHMPP Consortium: OpenHMPP Concepts and Directives. http://en.wikipedia.org/wiki/OpenHMPP. Accessed 31 Jan 2015
OpenMP Architecture Review Board: OpenMP Application Program Interface (Version 4.0). http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf. Accessed 31 Jan 2015
Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Article Google Scholar
The Khronos Group Inc.: The OpenCL Specification (Version 2.0). http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf. Accessed 31 Jan 2015
The Khronos Group Inc.: The OpenGL Shading Language (Version 4.50). https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf. Accessed 31 Jan 2015
The OpenACC Standards Group: The OpenACC Application Programming Interface (Version 2.0a). http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf. Accessed 31 Jan 2015
Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 54:1–54:23 (2013)
Article Google Scholar
Viñas, M., Lobeiras, J., Fraguela, B.B., Arenaz, M., Amor, M., García, J.A., Castro, M.J., Doallo, R.: A multi-GPU shallow-water simulation with transport of contaminants. Concurr. Comput. Pract. Exp. 25(8), 1153–1169 (2013)
Article Google Scholar
Volkov, V.: Better performance at lower occupancy. In: Proceedings of the 2010 GPU technology conference (GTC), San Jose, CA, USA. NVIDIA (2010)
Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), Pittsburgh, PA, USA, pp. 43–50. ACM (2010)
Zima, E.: Simplification and optimization of transformations of chains of recurrences. In: Proceedings of the 1995 International Symposium on Symbolic and Algebraic Computation (ISSAC), Montreal, Canada, pp. 42–50. ACM (1995)
Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel Distrib. Syst. 24(3), 417–427 (2013)
Article Google Scholar

Download references

Acknowledgments

This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER Funds of the European Union (Projects TIN2010-16735 and TIN2013-42148-P), by the Galician Government under the Consolidation Program of Competitive Reference Groups (Reference GRC2013-055), and by the FPU Program of the Ministry of Education of Spain (Reference AP2008-01012). We want to acknowledge the staff of CAPS Entreprise for their support to do this work, as well as Roberto R. Expósito for his help to configure the cluster pluton to carry out our experiments. Finally we want to thank the anonymous reviewers for their suggestions, which helped improve the paper.

Author information

Authors and Affiliations

Dep. de Electrónica e Sistemas, Universidade da Coruña, Campus de Elviña, 15071, A Coruña, Spain
José M. Andión, Manuel Arenaz, Gabriel Rodríguez & Juan Touriño
Institut de Recherche en Informatique et Systèmes Aléatoires, Campus de Beaulieu, 35042, Rennes, France
François Bodin

Authors

José M. Andión
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Arenaz
View author publications
You can also search for this author in PubMed Google Scholar
François Bodin
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Juan Touriño
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José M. Andión.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Andión, J.M., Arenaz, M., Bodin, F. et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. Int J Parallel Prog 44, 620–643 (2016). https://doi.org/10.1007/s10766-015-0362-9

Download citation

Received: 15 August 2014
Accepted: 10 March 2015
Published: 20 March 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s10766-015-0362-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Abstract

Access this article

Similar content being viewed by others

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Abstract

Access this article

Similar content being viewed by others

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation