ABSTRACT
The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: A language and compiler for algorithmic choice. PLDI ’09. Google ScholarDigital Library
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, J. Absar, S. van Haastregt, A. Kravets, et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. 2015.Google Scholar
- A. Barvinok. Lattice points, polyhedra, and complexity. Geometric Combinatorics, IAS/Park City Mathematics Series, 13, 2007.Google Scholar
- M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. CC’10/ETAPS’10. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. PLDI ’08. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC ’09,. Google ScholarDigital Library
- S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. IISWC ’10,. Google ScholarDigital Library
- B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence analysis and optimizations. PACT ’11. Google ScholarDigital Library
- R. Das, J. Wu, J. Saltz, H. Berryman, and S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Comput., 44, 1995. Google ScholarDigital Library
- M. Griebl and J.-F. Collard. Generation of synchronous code for automatic parallelization of while loops. EURO-PAR ’95. Google ScholarDigital Library
- M. Griebl, P. Feautrier, and C. Lengauer. Index set splitting. International Journal of Parallel Programming, 28, 1999. Google ScholarCross Ref
- T. Grosser, A. Größlinger, and C. Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 2012.Google Scholar
- T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37, 2015. Google ScholarDigital Library
- P. Jääskeläinen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43, 2015. Google ScholarDigital Library
- R. Karrenberg and S. Hack. Improving performance of OpenCL on CPUs. CC ’12. Google ScholarDigital Library
- A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic compilation of data-parallel kernels for vector processors. CGO ’12. Google ScholarDigital Library
- H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Localitycentric thread scheduling for bulk-synchronous programming models on CPU architectures. CGO ’15. Google ScholarDigital Library
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. PLDI ’13. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. CGO ’’04. Google ScholarDigital Library
- S. Moll. Decompilation of LLVM IR, 2011.Google Scholar
- N. Moore, M. Leeser, and L. Smith King. Kernel specialization for improved adaptability and performance on graphics processing units (GPUs). PDP ’13. Google ScholarDigital Library
- C. Nugteren and V. Codreanu. CLTune: A generic auto-tuner for OpenCL kernels. MCSoC ’15, 2015. Google ScholarDigital Library
- B. Pradelle, P. Clauss, and V. Loechner. Adaptive runtime selection of parallel schedules in the polytope model. HPC ’11. Google ScholarDigital Library
- N. Rotem. Intel Opencl Implicit Vectorization Module, 2011.Google Scholar
- J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance traps in OpenCL for CPUs. PDP ’13. Google ScholarDigital Library
- J. Shin, M. W. Hall, and J. Chame. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures. Microprocessors and Microsystems, 33, 2009. Google ScholarDigital Library
- J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu. Efficient compilation of fine-grained spmd-threaded programs for multicore CPUs. CGO ’10. Google ScholarDigital Library
- J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12- 01, University of Illinois at Urbana-Champaign, 2012.Google Scholar
- K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 0, 2009. Google ScholarDigital Library
- T. Vajk, Z. Dávid, M. Asztalos, G. Mezei, and T. Levendovszky. Runtime model validation with parallel object constraint language. MoDeVV ’11. Google ScholarDigital Library
- S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., 9, 2013. Google ScholarDigital Library
- M. J. Voss and R. Eigemann. High-level adaptive program optimization with adapt. PPoPP ’01. Google ScholarDigital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. ASPLOS XVI, 2011. Google ScholarDigital Library
Index Terms
- Input space splitting for OpenCL
Recommendations
Accelerating a Climate Physics Model with OpenCL
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingOpen Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL '21: Proceedings of the 9th International Workshop on OpenCLThe Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysField-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Comments