Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Sotomayor, Rafael; Sanchez, Luis Miguel; Garcia Blas, Javier; Fernandez, Javier; Garcia, J. Daniel

doi:10.1007/s10766-016-0425-6

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Published: 13 May 2016

Volume 45, pages 262–282, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Rafael Sotomayor¹,
Luis Miguel Sanchez¹,
Javier Garcia Blas¹,
Javier Fernandez¹ &
…
J. Daniel Garcia¹

648 Accesses
5 Citations
Explore all metrics

Abstract

Parallelism has become one of the most extended paradigms used to improve performance. However, it forces software developers to adapt applications and coding mechanisms to exploit the available computing devices. Legacy source code needs to be re-written to take advantage of multi- core and many-core computing devices. Writing parallel applications in a traditional way is hard, expensive, and time consuming. Furthermore, there is often more than one possible transformation or optimization that can be applied to a single piece of legacy code. Therefore many parallel versions of the same original sequential code need to be considered. In this paper, we describe an automatic parallel source code generation workflow (REWORK) for parallel heterogeneous platforms. REWORK automatically identifies promising kernels on legacy C++ source code and generates multiple specific versions of kernels for improving C++ applications, selecting the most adequate version based on both static source code and target platform characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Article 08 August 2019

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Article Open access 06 December 2022

Towards High-Performance Code Generation for Multi-GPU Clusters Based on a Domain-Specific Language for Algorithmic Skeletons

Article Open access 22 May 2020

References

Aldinucci, M., Meneghin, M., Torquati, M.: Efficient smith-waterman on multi-core with fastflow. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 195–199. IEEE (2010)
Baghdadi, S., Größlinger, A., Cohen, A.: Putting automatic polyhedral compilation for GPGPU to work. In: Proceedings of the 15th Workshop on Compilers for Parallel Computers (CPC’10). Vienna, Austria (2010)
Baráth, Á., Porkoláb, Z.: Attribute-based checking of C++ move semantics. In: Proceedings of the 3rd Workshop on Software Quality Analysis, Monitoring, Improvement and Applications (SQAMIA 2014), Lovran, Croatia, September 19-22, 2014., pp. 9–14 (2014)
Baskaran, M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) Compiler construction. Lecture notes in computer science, vol. 6011, pp. 244–263. Springer, Berlin (2010)
Chapter Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for gpgpus. In: Proceedings of the 22Nd Annual International Conference on Supercomputing. ICS ’08, pp. 225–234. ACM, New York, NY, USA (2008)
Bastoul, C.: Extracting polyhedral representation from high level languages. Tech. rep., LRI, Paris-Sud University (2008). Related to the Clan tool
Bertolli, C., Antao, S.F., Eichenberger, A.E., O’Brien, K., Sura, Z., Jacob, A.C., Chen, T., Sallenave, O.: Coordinating GPU Threads for OpenMP 4.0 in LLVM. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC. LLVM-HPC ’14, pp. 12–21. IEEE Press, Piscataway, NJ, USA (2014)
Bhattacharyya, A., Amaral, J.N.: Automatic Speculative Parallelization of Loops Using Polyhedral Dependence Analysis. In: Proceedings of the First International Workshop on Code OptimiSation for MultI and Many Cores, COSMIC ’13, pp. 1:1–1:9. ACM, New York, NY, USA (2013)
Bondhugula, U., Bandishti, V., Cohen, A., Potron, G., Vasilache, N.: Tiling and optimizing time-iterated computations on periodic domains. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. PACT ’14, pp. 39–50. ACM, New York, NY, USA (2014)
Bradski, G., Kaehler, A.: Learning OpenCV: computer vision with the OpenCV library. O’Reilly Media, Inc., California (2008)
Google Scholar
Campa, S., Danelutto, M., Goli, M., González-Vélez, H., Popescu, A.M., Torquati, M.: Parallel patterns for heterogeneous CPU/GPU architectures: structured parallelism from cluster to cloud. Future Gener. Comp. Syst. 37, 354–366 (2014)
Article Google Scholar
Doerfert, J., Hammacher, C., Streit, K., Hack, S.: SPolly: Speculative Optimizations in the Polyhedral Model. In: Proceedings 3rd International Workshop on Polyhedral Compilation Techniques (IMPACT), pp. 55–61. Berlin, Germany (2013)
Feld, D., Soddemann, T., Jünger, M., Mallach, S.: Hardware-aware automatic code-transformation to support compilers in exploiting the multi-level parallel potential of modern CPUs. In: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, COSMIC ’15, pp. 2:1–2:10. ACM, New York, NY, USA (2015)
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using opencl. In: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software. CC’11/ETAPS’11, pp. 286–305. Springer-Verlag, Berlin, Heidelberg (2011)
Grewe, D., Wang, Z., O’Boyle, M.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pp. 1–10 (2013)
GROSSER, T., GROESSLINGER, A., LENGAUER, C.: Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Proc. Lett. 22(04), 1250,010 (2012)
Article MathSciNet Google Scholar
ISO/IEC: Information technology—programming languages – C++. International Standard ISO/IEC 14882:20111, ISO/IEC, Geneva, Switzerland (2011)
Lincke, R., Lundberg, J., Löwe, W.: Comparing software metrics tools. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis. ISSTA ’08, pp. 131–142. ACM, New York, NY, USA (2008)
Ma, K., Li, X., Chen, W., Zhang, C., Wang, X.: GreenGPU: A holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In: 2012 41st International Conference on Parallel Processing (ICPP), pp. 48–57. IEEE (2012)
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976)
Article MathSciNet MATH Google Scholar
Mikushin, D., Likhogrud, N., Zhang, E.Z., Bergstrom, C.: KernelGen - The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs. In: 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, May 19-23, 2014, pp. 1011–1020. IEEE (2014)
Nugteren, C., Corporaal, H.: Bones: an automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Trans. Archit. Code Optim. 11(4), 35:1–35:25 (2014)
Article Google Scholar
OpenCL: open computing language. http://www.khronos.org/opencl (2015)
Par4All: automatic parallelizing and optimizing compiler. http://www.par4all.org/ (2015)
PPCG: Automatic parallelizing and optimizing compiler. http://freecode.com/projects/ppcg (2015)
REPARA website (2015). http://repara-project.eu/
Saaty, T.: Fundamentals of the analytic hierarchy process. RWS Publications, 4922 Ellsworth Avenue, Pittsburgh, PA 15413 (2000)
Sanchez, L.M., Fernandez, J., Sotomayor, R., Escolar, S., Garcia, J.D.: A comparative study and evaluation of parallel programming models for shared-memory parallel architectures. New Gener. Comput. 31(3), 139–161 (2013)
Article Google Scholar
Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: Workload Characterization (IISWC), 2011 IEEE International Symposium on, pp. 137–148 (2011)
Serban, T., Danelutto, M., Kilpatrick, P.: Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes. In: International Conference on High Performance Computing & Simulation, HPCS 2013, Helsinki, Finland, July 1-5, 2013, pp. 72–79 (2013)
Thouti, K., Sathe, S.R.: A methodology for translating C-programs to openCL. Int. J. Comput. Appl. 82(3), 11–15 (2013)
Google Scholar
Viñas, M., Fraguela, B.B., Bozkus, Z., Andrade, D.: Improving OpenCL programmability with the heterogeneous programming library. Procedia Computer Science 51, 110–119 (2015). International Conference On Computational Science, ICCS 2015Computational Science at the Gates of Nature
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC: First experiences with real-world applications. In: Proceedings of the 18th International Conference on Parallel Processing. Euro-Par’12, pp. 859–870. Springer, Berlin, Heidelberg (2012)

Download references

Acknowledgments

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n. 609666 (REPARA) and by the Spanish Ministry of Economics and Competitiveness under the grant TIN2013-41350-P.

Author information

Authors and Affiliations

University Carlos III of Madrid, Av. de la Universidad, 30, 28911, Leganes, Madrid, Spain
Rafael Sotomayor, Luis Miguel Sanchez, Javier Garcia Blas, Javier Fernandez & J. Daniel Garcia

Authors

Rafael Sotomayor
View author publications
You can also search for this author in PubMed Google Scholar
Luis Miguel Sanchez
View author publications
You can also search for this author in PubMed Google Scholar
Javier Garcia Blas
View author publications
You can also search for this author in PubMed Google Scholar
Javier Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
J. Daniel Garcia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Miguel Sanchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sotomayor, R., Sanchez, L.M., Garcia Blas, J. et al. Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications. Int J Parallel Prog 45, 262–282 (2017). https://doi.org/10.1007/s10766-016-0425-6

Download citation

Received: 03 September 2015
Accepted: 31 March 2016
Published: 13 May 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s10766-016-0425-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Abstract

Access this article

Similar content being viewed by others

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Towards High-Performance Code Generation for Multi-GPU Clusters Based on a Domain-Specific Language for Algorithmic Skeletons

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Abstract

Access this article

Similar content being viewed by others

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Towards High-Performance Code Generation for Multi-GPU Clusters Based on a Domain-Specific Language for Algorithmic Skeletons

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation