research-article

Modular array-based GPU computing in a dynamically-typed language

Authors:

Matthias Springer,

Peter Wauligmann,

Hidehiko MasuharaAuthors Info & Claims

ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

Pages 48 - 55

https://doi.org/10.1145/3091966.3091974

Published: 18 June 2017 Publication History

Abstract

Nowadays, GPU accelerators are widely used in areas with large data-parallel computations such as scientific computations or neural networks. Programmers can either write code in low-level CUDA/OpenCL code or use a GPU extension for a high-level programming language for better productivity. Most extensions focus on statically-typed languages, but many programmers prefer dynamically-typed languages due to their simplicity and flexibility.

This paper shows how programmers can write high-level modular code in Ikra, a Ruby extension for array-based GPU computing. Programmers can compose GPU programs of multiple reusable parallel sections, which are subsequently fused into a small number of GPU kernels. We propose a seamless syntax for separating code regions that extensively use dynamic language features from those that are compiled for efficient execution. Moreover, we propose symbolic execution and a program analysis for kernel fusion to achieve performance that is close to hand-written CUDA code.

References

[1]

M. Abadi, L. Cardelli, B. Pierce, and G. Plotkin. Dynamic typing in a statically typed language. ACM Trans. Program. Lang. Syst., 13(2):237–268, April 1991.

Digital Library

[2]

M. M.T. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating haskell array codes with multicore GPUs. DAMP ’11, pages 3–14. ACM, 2011.

Digital Library

[3]

J. Filipoviˇc, M. Madzin, J. Fousek, and L. Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, 71(10):3934–3957, 2015.

Digital Library

[4]

J. Fumero, M. Steuwer, L. Stadler, and C. Dubach. Just-in-time GPU compilation for interpreted languages with partial evaluation. VEE ’17, pages 60–73. ACM, 2017.

Digital Library

[5]

E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995.

Digital Library

[6]

T. Henriksen, K. F. Larsen, and C. E. Oancea. Design and GPGPU performance of Futhark’s redomap construct. ARRAY 2016, pages 17–24. ACM, 2016.

Digital Library

[7]

E. Holk, R. Newton, J. Siek, and A. Lumsdaine. Region-based memory management for GPU programming languages: Enabling rich data structures on a spartan host. OOPSLA ’14, pages 141–155. ACM.

Digital Library

[8]

F. B. Kjolstad and M. Snir. Ghost cell pattern. ParaPLoP ’10. ACM.

Digital Library

[9]

A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, O. Ivanov, and A. Fasih. PyCUDA and PyOpenCL: A scripting-based approach to GPU runtime code generation. Parallel Comput., 38(3):157–174, March 2012.

Digital Library

[10]

A. S. D. Lee and T. S. Abdelrahman. Launch-time optimization of OpenCL GPU kernels. GPGPU-10, pages 32–41. ACM, 2017.

Digital Library

[11]

B. Meyer. Object-Oriented Software Construction. Prentice-Hall, Inc., 1st edition, 1988.

Digital Library

[12]

S. Sato and H. Iwasaki. A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming, pages 79–94. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

Digital Library

[13]

J. Shen, A. L. Varbanescu, X. Martorell, and H. Sips. A study of application kernel structure for data parallel applications. Technical report, Delft University of Technology, 2015.

[14]

M. Springer and H. Masuhara. Object support in an array-based GPGPU extension for Ruby. ARRAY 2016, pages 25–31. ACM, 2016.

Digital Library

[15]

M. Viñas, Z. Bozkus, and B. B. Fraguela. Exploiting heterogeneous parallelism with the heterogeneous programming library. J. Parallel Distrib. Comput., 73(12):1627–1638, December 2013.

Digital Library

[16]

M. Wahib and N. Maruyama. Scalable kernel fusion for memorybound GPU applications. SC ’14, pages 191–202. IEEE Press, 2014.

Digital Library

[17]

H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. MICRO-45, pages 107–118. IEEE Computer Society, 2012.

Digital Library

[18]

T. Würthinger, C. Wimmer, A. Wöß, L. Stadler, G. Duboscq, C. Humer, G. Richards, D. Simon, and M. Wolczko. One VM to rule them all. Onward! 2013, pages 187–204. ACM, 2013.

Digital Library

[19]

Y. Yan, M. Grossman, and V. Sarkar. Jcuda: A programmer-friendly interface for accelerating Java programs with CUDA. Euro-Par ’09, pages 887–899. Springer-Verlag, 2009.

Digital Library

Cited By

Pati SAga SIslam MJayasena NSinclair MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & CollectivesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640410(1146-1164)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640410
Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Pati SAga SIslam MJayasena NSinclair M(2023)Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00026(140-153)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00026
Show More Cited By

Index Terms

Modular array-based GPU computing in a dynamically-typed language

Recommendations

Boosting CUDA Applications with CPU---GPU Hybrid Computing

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

June 2017

62 pages

ISBN:9781450350693

DOI:10.1145/3091966

General Chairs:
Martin Elsman
University of Copenhagen, Denmark
,
Clemens Grelck
University of Amsterdam
,
Andreas Kloeckner
Netherlands
,
David Padua
University of Illinois at Urbana-Champaign, USA
,
Edgar Solomonik
University of Illinois at Urbana-Champaign, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '17

Sponsor:

SIGPLAN

PLDI '17: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 18, 2017

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 17 of 25 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
114
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pati SAga SIslam MJayasena NSinclair MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & CollectivesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640410(1146-1164)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640410
Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Pati SAga SIslam MJayasena NSinclair M(2023)Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00026(140-153)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00026
Pati SAga SJayasena NSinclair M(2022)Demystifying BERT: System Design Implications2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00033(296-309)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00033
Li AZheng BPekhimenko GLong FLee J(2022)Automatic horizontal fusion for GPU kernelsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741270(14-27)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741270
Clarkson JFumero JPapadimitriou MZakkak FXekalaki MKotselidis CLuján MTilevich EMössenböck H(2018)Exploiting high-performance heterogeneous hardware for Java programs using graalProceedings of the 15th International Conference on Managed Languages & Runtimes10.1145/3237009.3237016(1-13)Online publication date: 12-Sep-2018
https://dl.acm.org/doi/10.1145/3237009.3237016

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten