skip to main content
10.1145/3122948.3122952acmconferencesArticle/Chapter ViewAbstractPublication PagesicfpConference Proceedingsconference-collections
research-article

Strategies for regular segmented reductions on GPU

Published:07 September 2017Publication History

ABSTRACT

We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input. We propose three different strategies for segmented reduction of regular arrays, each optimised for a particular workload. We demonstrate an implementation in the Futhark compiler that is able to employ all three strategies and automatically select the appropriate one at runtime. While our evaluation is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions.

We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two application benchmarks from the Rodinia suite. On the latter, we obtain speedups ranging from 1.3× to 1.7× over a previous implementation based on scans.

References

  1. Sean Baxter. 2013. Modern GPU 1.0. (2013). https://moderngpu.github.io/ segreduce.htmlGoogle ScholarGoogle Scholar
  2. Lars Bergstrom and John Reppy. 2012. Nested Data-parallelism on the Gpu. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP ’12) . ACM, New York, NY, USA, 247–258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Robert Bernecky and Sven-Bodo Scholz. 2015. Abstract Expressionism for Parallel Performance. In Proceedings of the 2Nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2015) . ACM, New York, NY, USA, 54–59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guy E. Blelloch. 1989. Scans as Primitive Parallel Operations. Computers, IEEE Transactions 38, 11 (1989), 1526–1538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Guy E. Blelloch. 1996. Programming Parallel Algorithms. Communications of the ACM (CACM) 39, 3 (1996), 85–97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Manuel MT Chakravarty, Gabriele Keller, Sean Lee, Trevor L McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Procs. of the sixth workshop on Declarative aspects of multicore programming. ACM, 3–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Procs. of IEEE Int. Symp. on Workload Characterization (IISWC) . 44–54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. 2016. Lowlevel functional GPU programming for parallel algorithms. In Proceedings of the 5th International Workshop on Functional High-Performance Computing . ACM, 31–37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jing Guo, Jeyarajan Thiyagalingam, and Sven-Bodo Scholz. 2011. Breaking the GPU Programming Barrier with the Auto-parallelising SAC Compiler. In Procs. Workshop Decl. Aspects of Multicore Prog. (DAMP) . ACM, 15–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mark Harris et al. 2007. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology 2, 4 (2007).Google ScholarGoogle Scholar
  11. Troels Henriksen, Martin Dybdal, Henrik Urms, Anna Sofie Kiehn, Daniel Gavin, Hjalte Abelskov, Martin Elsman, and Cosmin Oancea. 2016. APL on GPUs: A TAIL from the Past, Scribbled in Futhark. In Procs. of the 5th Int. Workshop on Functional High-Performance Computing (FHPC’16) . ACM, New York, NY, USA, 38–43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Troels Henriksen, Martin Elsman, and Cosmin E. Oancea. 2014. Size Slicing: A Hybrid Approach to Size Inference in Futhark. In Procs. of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing (FHPC’14) . ACM, New York, NY, USA, 31–42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Troels Henriksen, Ken Friis Larsen, and Cosmin E. Oancea. 2016. Design and GPGPU Performance of Futhark’s Redomap Construct. In Procs. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16) . ACM, New York, NY, USA, 17–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Troels Henriksen and Cosmin E. Oancea. 2014. Bounds Checking: An Instance of Hybrid Analysis. In Procs. of ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14) . ACM, New York, NY, USA, Article 88, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Troels Henriksen, Niels G. W Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’17) . ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jared Hoberock and Nathan Bell. 2016. Thrust: A Parallel Template Library. (2016). http://thrust.github.io/Google ScholarGoogle Scholar
  17. HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-Aware Mapping of Nested Parallel Patterns on GPUs. In Procs. of the 47th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 63–74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Duane Merrill. 2017. CUB. https://github.com/NVlabs/cub . (2017).Google ScholarGoogle Scholar
  19. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (March 2008), 40–53. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Strategies for regular segmented reductions on GPU

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing
        September 2017
        52 pages
        ISBN:9781450351812
        DOI:10.1145/3122948

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 September 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate18of25submissions,72%

        Upcoming Conference

        ICFP '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader