Abstract
We advocate the use of formal patterns and transformations for programming modern many-core processors like Graphics Processing Units (GPU), as an alternative to the currently used low-level, ad hoc programming approaches like CUDA or OpenCL. Our new contribution is introducing an intermediate level of low-level patterns in order to bridge the abstraction gap between popular high-level patterns (\({map}\), fold/reduce, \({zip}\), etc.) and imperative, executable code for many-cores. We define our low-level patterns based on the OpenCL programming model which is portable across parallel architectures of different vendors, and we introduce semantics-preserving rewrite rules that transform programs with high-level patterns into programs with low-level patterns, from which executable OpenCL programs are automatically generated. We show that program design decisions and optimizations, which are usually applied ad-hoc by experts, are systematically expressed in our approach as provably-correct transformations for high- and low-level patterns. We evaluate our approach by systematically deriving several differently optimized OpenCL implementations of parallel reduction that achieve performance competitive with OpenCL programs which are manually written and highly tuned by performance experts.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: Fastflow: high-level and efficient streaming on multi-core. In: Programming Multi-core and Many-core Computing Systems. Wiley-Blackwell, Hoboken (2011)
AMD: Bolt C++ Template Library
Backus, J.: Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Commun. ACM 21(8), 613–641 (1978)
Bird, R.S.: Algebraic identities for program calculation. Comput. J. 32(2), 122–126 (1989)
Burstall, R.M., Darlington, J.: A transformation system for developing recursive programs. J. ACM 24(1), 44–67 (1977)
Chakravarty, M., Keller, G., Lee, S., McDonell, T.L., Grover, V.: Accelerating Haskell array codes with multicore GPUs. In: DAMP, pp. 3–14. ACM (2011)
Gorlatch, S., Cole, M.: Parallel skeletons. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4
Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technol. 2(4), 1–39 (2007)
Holk, E., Byrd, W.E., Mahajan, N., Willcock, J., Chauhan, A., Lumsdaine, A.: Declarative parallel programming for GPUs. In: PARCO, pp. 297–304 (2011)
Khronos OpenCL Working Group: The OpenCL Specification
Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45706-2_86
Nvidia: CUDA Basic Linear Algebra Subroutines (cuBLAS). Version 6.5
Steuwer, M., Fensch, C., Lindley, S., Dubach, C.: Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance openCL code. In: ICFP, pp. 205–217. ACM (2015)
Steuwer, M., Gorlatch, S.: High-level programming for medical imaging on multi-GPU systems using the skelCL library. In: Procedia Computer Science, ICCS, vol. 18, pp. 749–758. Elsevier (2013)
Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL: a portable skeleton library for high-level GPU programming. In: HIPS @ IPDPS, pp. 1176–1182. IEEE (2011)
Steuwer, M., Remmelg, T., Dubach, C.: Lift: a functional data-parallel IR for high-performance GPU code generation. In: CGO, pp. 74–85. ACM (2017)
Svensson, J., Sheeran, M., Claessen, K.: Obsidian: a domain specific embedded language for parallel programming of graphics processors. In: Scholz, S.-B., Chitil, O. (eds.) IFL 2008. LNCS, vol. 5836, pp. 156–173. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24452-0_9
Acknowledgments
This work was supported by the German Research Council (DFG) within the Cluster of Excellence CiM (University of Münster), by the German Ministry of Education and Research (BMBF) within the project HPC\(^2\)SE, and by a EuroLab-4-HPC collaboration. We thank Nvidia for their generous hardware donation used in our experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Additional Rewrite Rules
B Proof of a Rewrite Rule
Rewrite rules are proved using equational reasoning. As an example we prove rule (25) which introduces layers in the computation hierarchy of a reduction: first a partial reduction is computed, followed by a reduction combining all temporary results.
Proof
(Reduce-Promotion Variant). Let n be a number divisible by m.
C Derived Low-Level Reduction Programs
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Hagedorn, B., Steuwer, M., Gorlatch, S. (2018). A Transformation-Based Approach to Developing High-Performance GPU Programs. In: Petrenko, A., Voronkov, A. (eds) Perspectives of System Informatics. PSI 2017. Lecture Notes in Computer Science(), vol 10742. Springer, Cham. https://doi.org/10.1007/978-3-319-74313-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-74313-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74312-7
Online ISBN: 978-3-319-74313-4
eBook Packages: Computer ScienceComputer Science (R0)