skip to main content
10.1145/2892208.2892217acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

Input space splitting for OpenCL

Published:17 March 2016Publication History

ABSTRACT

The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.

References

  1. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI ’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. 2015.Google ScholarGoogle Scholar
  3. A. Barvinok. Lattice points, polyhedra, and complexity. Geometric Combinatorics, IAS/Park City Mathematics Series, 13, 2007.Google ScholarGoogle Scholar
  4. M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. CC’10/ETAPS’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. PLDI ’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC ’09,. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. IISWC ’10,. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence analysis and optimizations. PACT ’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Griebl and J.-F. Collard. Generation of synchronous code for automatic parallelization of while loops. EURO-PAR ’95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Grosser, A. Größlinger, and C. Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 2012.Google ScholarGoogle Scholar
  13. T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Karrenberg and S. Hack. Improving performance of OpenCL on CPUs. CC ’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic compilation of data-parallel kernels for vector processors. CGO ’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures. CGO ’15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. PLDI ’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. CGO ’’04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Moll. Decompilation of LLVM IR, 2011.Google ScholarGoogle Scholar
  21. N. Moore, M. Leeser, and L. Smith King. Kernel specialization for improved adaptability and performance on graphics processing units (GPUs). PDP ’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. MCSoC ’15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Pradelle, P. Clauss, and V. Loechner. Adaptive runtime selection of parallel schedules in the polytope model. HPC ’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Rotem. Intel Opencl Implicit Vectorization Module, 2011.Google ScholarGoogle Scholar
  25. J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance traps in OpenCL for CPUs. PDP ’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Shin, M. W. Hall, and J. Chame. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems, 33, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu. Efficient compilation of fine-grained spmd-threaded programs for multicore CPUs. CGO ’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12- 01, University of Illinois at Urbana-Champaign, 2012.Google ScholarGoogle Scholar
  29. K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 0, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Vajk, Z. Dávid, M. Asztalos, G. Mezei, and T. Levendovszky. Runtime model validation with parallel object constraint language. MoDeVV ’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. J. Voss and R. Eigemann. High-level adaptive program optimization with adapt. PPoPP ’01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS XVI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Input space splitting for OpenCL

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CC 2016: Proceedings of the 25th International Conference on Compiler Construction
      March 2016
      270 pages
      ISBN:9781450342414
      DOI:10.1145/2892208
      • General Chair:
      • Ayal Zaks,
      • Program Chair:
      • Manuel Hermenegildo

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 March 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader