research-article

Systematically extending a high-level code generator with support for tensor cores

Authors:

Bastian Köpcke,

Sergei Gorlatch,

Michel SteuwerAuthors Info & Claims

GPGPU '22: Proceedings of the 14th Workshop on General Purpose Processing Using GPU

Article No.: 3, Pages 1 - 6

https://doi.org/10.1145/3530390.3532733

Published: 18 May 2022 Publication History

Abstract

High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code "for free". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves.

In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program.

Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.

References

[1]

Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In DAMP.

[2]

Abdul Dakkak, Cheng Li, Isaac Gelado, Jinjun Xiong, and Wen-Mei W. Hwu. 2018. Accelerating Reduction and Scan Using Tensor Core Units. arXiv:1811.09736

[3]

Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodík, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In PACT.

[4]

Bastian Hagedorn, Johannes Lenfers, Thomas Koehler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. In ICFP.

[5]

John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (2019).

[6]

Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely func. GPU-programming with nested parallelism and in-place array updates. In PLDI.

[7]

Thomas Koehler and Michel Steuwer. 2021. Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs. In CGO.

[8]

Thomas Koehler, Phil Trinder, and Michel Steuwer. 2021. Sketch-Guided Equality Saturation: Scaling Equality Saturation to Complex Optimizations in Languages with Bindings. arXiv:2111.13040

[9]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In IPDPS Workshops.

[10]

Cristóbal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, and Raimundo Vega. 2021. GPU Tensor Cores for Fast Arithmetic Reductions. IEEE TPDS 32, 1 (2021).

[11]

Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the point: index sets and parallelism-preserving autodiff for pointful array prog. In ICFP.

[12]

Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2018. Halide: decoupling algorithms from schedules for high-performance image processing. Commun. ACM 61, 1(2018).

[13]

Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2021. AI Accelerator Survey and Trends. In HPEC.

[14]

Anumeena Sorna, Xiaohe Cheng, Eduardo F. D'Azevedo, Kwai Wong, and Stanimire Tomov. 2018. Optimizing the FFT Using Mixed Precision on Tensor Core Hardware. In HiPCW Workshops.

[15]

Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating perf. portable code using rewrite rules: from high-level func. expr. to high-perf. OpenCL code. In ICFP.

[16]

Michel Steuwer, Thomas Koehler, Bastian Köpcke, and Federico Pizzuti. 2022. RISE & Shine: Language-Oriented Compiler Design. arXiv:2201.03611

Cited By

Index Terms

Systematically extending a high-level code generator with support for tensor cores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
    2. General programming languages

Index terms have been assigned to the content through auto-classification.

Recommendations

A code generator for high-performance tensor contractions on GPUs
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

Tensor contractions are higher dimensional generalizations of matrix-matrix multiplication. They form the compute-intensive core of many applications in computational science and data science. In this paper, we describe a high-performance GPU code ...
Experiences in extending parallware to support OpenACC
WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives

Porting scientific codes to accelerator-based computers using OpenACC and OpenMP is an important topic for the HPC community. Programmability, performance portability and developer productivity are key issues for the widespread use of these systems. In ...
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP '15

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GPGPU '22: Proceedings of the 14th Workshop on General Purpose Processing Using GPU

April 2022

33 pages

ISBN:9781450393485

DOI:10.1145/3530390

Program Chairs:
Yifan Sun
William & Mary
,
Daniel Wong
University of California, Riverside
,
Hoda Naghibijouybari
State University of New York at Binghamton

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

PPoPP '22

Sponsor:

PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

April 3, 2022

Seoul, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
115
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten