skip to main content
10.1145/3585341.3585357acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoclConference Proceedingsconference-collections
abstract

VkFFT and beyond - a platform for runtime GPU code generation

Published:18 April 2023Publication History

ABSTRACT

This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it.

The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers.

The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed.

This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.

Index Terms

  1. VkFFT and beyond - a platform for runtime GPU code generation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      IWOCL '23: Proceedings of the 2023 International Workshop on OpenCL
      April 2023
      133 pages
      ISBN:9798400707452
      DOI:10.1145/3585341

      Copyright © 2023 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 April 2023

      Check for updates

      Qualifiers

      • abstract
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate84of152submissions,55%
    • Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format