skip to main content
10.1145/2892208.2892217acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

Input space splitting for OpenCL

Published: 17 March 2016 Publication History

Abstract

The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.

References

[1]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI ’09.
[2]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. 2015.
[3]
A. Barvinok. Lattice points, polyhedra, and complexity. Geometric Combinatorics, IAS/Park City Mathematics Series, 13, 2007.
[4]
M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. CC’10/ETAPS’10.
[5]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. PLDI ’08.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC ’09,.
[7]
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. IISWC ’10,.
[8]
B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence analysis and optimizations. PACT ’11.
[9]
R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44, 1995.
[10]
M. Griebl and J.-F. Collard. Generation of synchronous code for automatic parallelization of while loops. EURO-PAR ’95.
[11]
M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28, 1999.
[12]
T. Grosser, A. Größlinger, and C. Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 2012.
[13]
T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37, 2015.
[14]
P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43, 2015.
[15]
R. Karrenberg and S. Hack. Improving performance of OpenCL on CPUs. CC ’12.
[16]
A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic compilation of data-parallel kernels for vector processors. CGO ’12.
[17]
H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures. CGO ’15.
[18]
M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. PLDI ’13.
[19]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. CGO ’’04.
[20]
S. Moll. Decompilation of LLVM IR, 2011.
[21]
N. Moore, M. Leeser, and L. Smith King. Kernel specialization for improved adaptability and performance on graphics processing units (GPUs). PDP ’13.
[22]
C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. MCSoC ’15, 2015.
[23]
B. Pradelle, P. Clauss, and V. Loechner. Adaptive runtime selection of parallel schedules in the polytope model. HPC ’11.
[24]
N. Rotem. Intel Opencl Implicit Vectorization Module, 2011.
[25]
J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance traps in OpenCL for CPUs. PDP ’13.
[26]
J. Shin, M. W. Hall, and J. Chame. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems, 33, 2009.
[27]
J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu. Efficient compilation of fine-grained spmd-threaded programs for multicore CPUs. CGO ’10.
[28]
J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12- 01, University of Illinois at Urbana-Champaign, 2012.
[29]
K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 0, 2009.
[30]
T. Vajk, Z. Dávid, M. Asztalos, G. Mezei, and T. Levendovszky. Runtime model validation with parallel object constraint language. MoDeVV ’11.
[31]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9, 2013.
[32]
M. J. Voss and R. Eigemann. High-level adaptive program optimization with adapt. PPoPP ’01.
[33]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS XVI, 2011.

Cited By

View all
  • (2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
  • (2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Nov-2022
  • (2022)Efficient execution of OpenMP on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741290(41-52)Online publication date: 2-Apr-2022
  • Show More Cited By

Index Terms

  1. Input space splitting for OpenCL

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CC '16: Proceedings of the 25th International Conference on Compiler Construction
    March 2016
    270 pages
    ISBN:9781450342414
    DOI:10.1145/2892208
    • General Chair:
    • Ayal Zaks,
    • Program Chair:
    • Manuel Hermenegildo
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE-CS: Computer Society

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 March 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Divergence
    2. OpenCL
    3. Polyhedral Representation
    4. SPMD
    5. Vectorization

    Qualifiers

    • Research-article

    Conference

    CGO '16

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577475(119-134)Online publication date: 25-Feb-2023
    • (2022)Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00015(100-110)Online publication date: Nov-2022
    • (2022)Efficient execution of OpenMP on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741290(41-52)Online publication date: 2-Apr-2022
    • (2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
    • (2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
    • (2019)Compiler Optimizations for Parallel ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_9(112-119)Online publication date: 13-Nov-2019
    • (2018)Polyhedral expression propagationProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179529(25-36)Online publication date: 24-Feb-2018
    • (2018)Compiler Optimizations for OpenMPEvolving OpenMP for Evolving Architectures10.1007/978-3-319-98521-3_8(113-127)Online publication date: 29-Aug-2018
    • (2017)Optimistic loop optimizationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049864(292-304)Online publication date: 4-Feb-2017
    • (2017)Optimistic loop optimization2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863748(292-304)Online publication date: Feb-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media