ABSTRACT
We developed an OpenCL GPU kernel fusion library for the Stan software for Bayesian statistics. The library automatically combines kernels, optimizes computation, and is simple to use. The practical utility of the library is that it speeds up the development of new GPU kernels while keeping the performance of automatically combined kernels comparable to hand crafted kernels. We demonstrate this with experiments on basic operations and a linear regression model likelihood.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.Google ScholarDigital Library
- Paul-Christian Bürkner et al. 2017. brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software 80, 1 (2017), 1--28.Google ScholarCross Ref
- Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of statistical software 76, 1 (2017).Google ScholarCross Ref
- Jiří Filipovič and Siegfried Benkner. 2015. OpenCL kernel fusion for GPU, Xeon Phi and CPU. In 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 98--105.Google Scholar
- Jiří Filipovič, Matúš Madzin, Jan Fousek, and Luděk Matyska. 2015. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing 71, 10 (2015), 3934--3957.Google ScholarDigital Library
- Jonah Gabry and Ben Goodrich. 2016. rstanarm: Bayesian Applied Regression Modeling via Stan. R package version 2, 1 (2016).Google Scholar
- Gaël Guennebaud and Benoît Jacob and others. 2010. Eigen v3. http://eigen.tuxfamily.org.Google Scholar
- Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Andreas Morhammer, Tibor Grasser, Ansgar Jungel, and Siegfried Selberherr. 2016. ViennaCL---linear algebra library for multi-and many-core architectures. SIAM Journal on Scientific Computing 38, 5 (2016), S412-S439.Google Scholar
- Sean J Taylor and Benjamin Letham. 2018. Forecasting at Scale. The American Statistician 72, 1 (2018), 37--45.Google ScholarCross Ref
- Rok Češnovar, Steve Bronder, Davor Sluga, Jure Demšar, Tadej Ciglarič, Sean Talts, and Erik Štrumbelj. 2019. GPU-based Parallel Computation Support for Stan. CoRR abs/1907.01063 (2019). arXiv:1907.01063 http://arxiv.org/abs/1907.01063Google Scholar
- Todd Veldhuizen. 1995. Expression templates. C++ Report 7, 5 (1995), 26--31.Google ScholarDigital Library
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. 2019. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python. arXiv e-prints, Article arXiv:1907.10121 (Jul 2019), arXiv:1907.10121 pages. arXiv:cs.MS/1907.10121Google Scholar
Index Terms
- Automated OpenCL GPU kernel fusion for Stan Math
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Comments