research-article

Strategies for regular segmented reductions on GPU

Authors:
Rasmus Wriedt Larsen

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

,
Troels Henriksen

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance ComputingSeptember 2017Pages 42–52https://doi.org/10.1145/3122948.3122952

Published:07 September 2017Publication History

FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing

Pages 42–52

ABSTRACT

We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input. We propose three different strategies for segmented reduction of regular arrays, each optimised for a particular workload. We demonstrate an implementation in the Futhark compiler that is able to employ all three strategies and automatically select the appropriate one at runtime. While our evaluation is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions.

We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two application benchmarks from the Rodinia suite. On the latter, we obtain speedups ranging from 1.3× to 1.7× over a previous implementation based on scans.

References

Sean Baxter. 2013. Modern GPU 1.0. (2013). https://moderngpu.github.io/ segreduce.htmlGoogle Scholar
Lars Bergstrom and John Reppy. 2012. Nested Data-parallelism on the Gpu. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP ’12) . ACM, New York, NY, USA, 247–258. Google ScholarDigital Library
Robert Bernecky and Sven-Bodo Scholz. 2015. Abstract Expressionism for Parallel Performance. In Proceedings of the 2Nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015) . ACM, New York, NY, USA, 54–59. Google ScholarDigital Library
Guy E. Blelloch. 1989. Scans as Primitive Parallel Operations. Computers, IEEE Transactions 38, 11 (1989), 1526–1538. Google ScholarDigital Library
Guy E. Blelloch. 1996. Programming Parallel Algorithms. Communications of the ACM (CACM) 39, 3 (1996), 85–97. Google ScholarDigital Library
Manuel MT Chakravarty, Gabriele Keller, Sean Lee, Trevor L McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Procs. of the sixth workshop on Declarative aspects of multicore programming. ACM, 3–14. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Procs. of IEEE Int. Symp. on Workload Characterization (IISWC) . 44–54. Google ScholarDigital Library
Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. 2016. Lowlevel functional GPU programming for parallel algorithms. In Proceedings of the 5th International Workshop on Functional High-Performance Computing . ACM, 31–37. Google ScholarDigital Library
Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the GPU Programming Barrier with the Auto-parallelising SAC Compiler. In Procs. Workshop Decl. Aspects of Multicore Prog. (DAMP) . ACM, 15–24. Google ScholarDigital Library
Mark Harris et al. 2007. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology 2, 4 (2007).Google Scholar
Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea. 2016. APL on GPUs: A TAIL from the Past, Scribbled in Futhark. In Procs. of the 5th Int. Workshop on Functional High-Performance Computing (FHPC’16) . ACM, New York, NY, USA, 38–43. Google ScholarDigital Library
Troels Henriksen, Martin Elsman, and Cosmin E. Oancea. 2014. Size Slicing: A Hybrid Approach to Size Inference in Futhark. In Procs. of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing (FHPC’14) . ACM, New York, NY, USA, 31–42. Google ScholarDigital Library
Troels Henriksen, Ken Friis Larsen, and Cosmin E. Oancea. 2016. Design and GPGPU Performance of Futhark’s Redomap Construct. In Procs. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16) . ACM, New York, NY, USA, 17–24. Google ScholarDigital Library
Troels Henriksen and Cosmin E. Oancea. 2014. Bounds Checking: An Instance of Hybrid Analysis. In Procs. of ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14) . ACM, New York, NY, USA, Article 88, 7 pages. Google ScholarDigital Library
Troels Henriksen, Niels G. W Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’17) . ACM, New York, NY, USA. Google ScholarDigital Library
Jared Hoberock and Nathan Bell. 2016. Thrust: A Parallel Template Library. (2016). http://thrust.github.io/Google Scholar
HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-Aware Mapping of Nested Parallel Patterns on GPUs. In Procs. of the 47th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 63–74. Google ScholarDigital Library
Duane Merrill. 2017. CUB. https://github.com/NVlabs/cub . (2017).Google Scholar
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (March 2008), 40–53. Google ScholarDigital Library

Index Terms

Strategies for regular segmented reductions on GPU
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program analysis

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Read More
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing
September 2017
52 pages
ISBN:9781450351812
DOI:10.1145/3122948
General Chairs:
Phil Trinder,
Cosmin Oancea
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 September 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
functional programming
parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate18of25submissions,72%
Upcoming Conference
ICFP '24

Sponsor:

sigplan

ACM SIGPLAN International Conference on Functional Programming

September 9 - 13, 2024

Milan , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 105
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Strategies for regular segmented reductions on GPU

FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Optimized HPL for AMD GPU and multi-core CPU usage

Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation