research-article

Input space splitting for OpenCL

Authors:

Johannes Doerfert,

Sebastian HackAuthors Info & Claims

CC '16: Proceedings of the 25th International Conference on Compiler Construction

Pages 251 - 260

https://doi.org/10.1145/2892208.2892217

Published: 17 March 2016 Publication History

Abstract

The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.

References

[1]

J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI ’09.

Digital Library

[2]

R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. 2015.

[3]

A. Barvinok. Lattice points, polyhedra, and complexity. Geometric Combinatorics, IAS/Park City Mathematics Series, 13, 2007.

[4]

M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. CC’10/ETAPS’10.

Digital Library

[5]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. PLDI ’08.

Digital Library

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC ’09,.

Digital Library

[7]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. IISWC ’10,.

Digital Library

[8]

B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence analysis and optimizations. PACT ’11.

Digital Library

[9]

R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44, 1995.

Digital Library

[10]

M. Griebl and J.-F. Collard. Generation of synchronous code for automatic parallelization of while loops. EURO-PAR ’95.

Digital Library

[11]

M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28, 1999.

[12]

T. Grosser, A. Größlinger, and C. Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 2012.

[13]

T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37, 2015.

Digital Library

[14]

P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43, 2015.

Digital Library

[15]

R. Karrenberg and S. Hack. Improving performance of OpenCL on CPUs. CC ’12.

Digital Library

[16]

A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic compilation of data-parallel kernels for vector processors. CGO ’12.

Digital Library

[17]

H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures. CGO ’15.

[18]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. PLDI ’13.

Digital Library

[19]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. CGO ’’04.

Digital Library

[20]

S. Moll. Decompilation of LLVM IR, 2011.

[21]

N. Moore, M. Leeser, and L. Smith King. Kernel specialization for improved adaptability and performance on graphics processing units (GPUs). PDP ’13.

Digital Library

[22]

C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. MCSoC ’15, 2015.

Digital Library

[23]

B. Pradelle, P. Clauss, and V. Loechner. Adaptive runtime selection of parallel schedules in the polytope model. HPC ’11.

Digital Library

[24]

N. Rotem. Intel Opencl Implicit Vectorization Module, 2011.

[25]

J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance traps in OpenCL for CPUs. PDP ’13.

Digital Library

[26]

J. Shin, M. W. Hall, and J. Chame. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems, 33, 2009.

Digital Library

[27]

J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu. Efficient compilation of fine-grained spmd-threaded programs for multicore CPUs. CGO ’10.

Digital Library

[28]

J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12- 01, University of Illinois at Urbana-Champaign, 2012.

[29]

K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 0, 2009.

Digital Library

[30]

T. Vajk, Z. Dávid, M. Asztalos, G. Mezei, and T. Levendovszky. Runtime model validation with parallel object constraint language. MoDeVV ’11.

Digital Library

[31]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9, 2013.

Digital Library

[32]

M. J. Voss and R. Eigemann. High-level adaptive program optimization with adapt. PPoPP ’01.

Digital Library

[33]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS XVI, 2011.

Digital Library

Cited By

Moses WIvanov IDomke JEndo TDoerfert JZinenko ODehnavi MKulkarni MKrishnamoorthy S(2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577475
Parasyris KGeorgakoudis GDoerfert JLaguna IScogland T(2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Nov-2022
https://doi.org/10.1109/P3HPC56579.2022.00015
Huber JCornelius MGeorgakoudis GTian SDiaz JDinel KChapman BDoerfert JLee J(2022)Efficient execution of OpenMP on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741290(41-52)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741290
Show More Cited By

Index Terms

Input space splitting for OpenCL
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Accelerating a Climate Physics Model with OpenCL
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CC '16: Proceedings of the 25th International Conference on Compiler Construction

March 2016

270 pages

ISBN:9781450342414

DOI:10.1145/2892208

General Chair:
Ayal Zaks
Intel, Israel / Technion, Israel
,
Program Chair:
Manuel Hermenegildo
IMDEA SW Institute, Spain / T.U. Madrid-UPM, Spain

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '16

Sponsor:

CGO '16: 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 17 - 18, 2016

Barcelona, Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
230
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Moses WIvanov IDomke JEndo TDoerfert JZinenko ODehnavi MKulkarni MKrishnamoorthy S(2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577475
Parasyris KGeorgakoudis GDoerfert JLaguna IScogland T(2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Nov-2022
https://doi.org/10.1109/P3HPC56579.2022.00015
Huber JCornelius MGeorgakoudis GTian SDiaz JDinel KChapman BDoerfert JLee J(2022)Efficient execution of OpenMP on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741290(41-52)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741290
Bastoul CZhang ZRazanajato HLossing NSusungi Ade Juan JFilhol EJarry BConsolaro GZhang RLee J(2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741260
Matz ADoerfert JFröning H(2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3409390.3409403
Doerfert JFinkel H(2019)Compiler Optimizations for Parallel ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_9(112-119)Online publication date: 13-Nov-2019
https://doi.org/10.1007/978-3-030-34627-0_9
Doerfert JSharma SHack SDubach CXue J(2018)Polyhedral expression propagationProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179529(25-36)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178372.3179529
Doerfert JFinkel H(2018)Compiler Optimizations for OpenMPEvolving OpenMP for Evolving Architectures10.1007/978-3-319-98521-3_8(113-127)Online publication date: 29-Aug-2018
https://doi.org/10.1007/978-3-319-98521-3_8
Doerfert JGrosser THack SReddi VSmith ATang L(2017)Optimistic loop optimizationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049864(292-304)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049864
Doerfert JGrosser THack S(2017)Optimistic loop optimization2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863748(292-304)Online publication date: Feb-2017
https://doi.org/10.1109/CGO.2017.7863748
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten