ABSTRACT
We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input. We propose three different strategies for segmented reduction of regular arrays, each optimised for a particular workload. We demonstrate an implementation in the Futhark compiler that is able to employ all three strategies and automatically select the appropriate one at runtime. While our evaluation is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions.
We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two application benchmarks from the Rodinia suite. On the latter, we obtain speedups ranging from 1.3× to 1.7× over a previous implementation based on scans.
- Sean Baxter. 2013. Modern GPU 1.0. (2013). https://moderngpu.github.io/ segreduce.htmlGoogle Scholar
- Lars Bergstrom and John Reppy. 2012. Nested Data-parallelism on the Gpu. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP ’12) . ACM, New York, NY, USA, 247–258. Google ScholarDigital Library
- Robert Bernecky and Sven-Bodo Scholz. 2015. Abstract Expressionism for Parallel Performance. In Proceedings of the 2Nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015) . ACM, New York, NY, USA, 54–59. Google ScholarDigital Library
- Guy E. Blelloch. 1989. Scans as Primitive Parallel Operations. Computers, IEEE Transactions 38, 11 (1989), 1526–1538. Google ScholarDigital Library
- Guy E. Blelloch. 1996. Programming Parallel Algorithms. Communications of the ACM (CACM) 39, 3 (1996), 85–97. Google ScholarDigital Library
- Manuel MT Chakravarty, Gabriele Keller, Sean Lee, Trevor L McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Procs. of the sixth workshop on Declarative aspects of multicore programming. ACM, 3–14. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Procs. of IEEE Int. Symp. on Workload Characterization (IISWC) . 44–54. Google ScholarDigital Library
- Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. 2016. Lowlevel functional GPU programming for parallel algorithms. In Proceedings of the 5th International Workshop on Functional High-Performance Computing . ACM, 31–37. Google ScholarDigital Library
- Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the GPU Programming Barrier with the Auto-parallelising SAC Compiler. In Procs. Workshop Decl. Aspects of Multicore Prog. (DAMP) . ACM, 15–24. Google ScholarDigital Library
- Mark Harris et al. 2007. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology 2, 4 (2007).Google Scholar
- Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea. 2016. APL on GPUs: A TAIL from the Past, Scribbled in Futhark. In Procs. of the 5th Int. Workshop on Functional High-Performance Computing (FHPC’16) . ACM, New York, NY, USA, 38–43. Google ScholarDigital Library
- Troels Henriksen, Martin Elsman, and Cosmin E. Oancea. 2014. Size Slicing: A Hybrid Approach to Size Inference in Futhark. In Procs. of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing (FHPC’14) . ACM, New York, NY, USA, 31–42. Google ScholarDigital Library
- Troels Henriksen, Ken Friis Larsen, and Cosmin E. Oancea. 2016. Design and GPGPU Performance of Futhark’s Redomap Construct. In Procs. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16) . ACM, New York, NY, USA, 17–24. Google ScholarDigital Library
- Troels Henriksen and Cosmin E. Oancea. 2014. Bounds Checking: An Instance of Hybrid Analysis. In Procs. of ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14) . ACM, New York, NY, USA, Article 88, 7 pages. Google ScholarDigital Library
- Troels Henriksen, Niels G. W Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’17) . ACM, New York, NY, USA. Google ScholarDigital Library
- Jared Hoberock and Nathan Bell. 2016. Thrust: A Parallel Template Library. (2016). http://thrust.github.io/Google Scholar
- HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-Aware Mapping of Nested Parallel Patterns on GPUs. In Procs. of the 47th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 63–74. Google ScholarDigital Library
- Duane Merrill. 2017. CUB. https://github.com/NVlabs/cub . (2017).Google Scholar
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (March 2008), 40–53. Google ScholarDigital Library
Index Terms
- Strategies for regular segmented reductions on GPU
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
Comments