skip to main content
10.1145/3572848.3577479acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
poster

Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

Published: 21 February 2023 Publication History

Abstract

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.

References

[1]
Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Software 42, 3 (June 2016), 1--31.
[2]
Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. (2017). https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
[3]
Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. The International Journal of High Performance Computing Applications 24, 4 (Nov. 2010), 511--515.
[4]
NVIDIA Corporation. 2007--2022. CUDA C++ Programming Guide. (Dec. 2007--2022). https://docs.nvidia.com/cuda/ PG-02829-001_v12.0.

Cited By

View all
  • (2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
  • (2024)Research on GRAPES Semi-Implicit Semi-Lagrangian Computation Optimization Based on CPU+GPU Heterogeneity2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704385(411-417)Online publication date: 19-Jul-2024
  • (2024)A Framework for Fine-Grained Synchronization of Dependent GPU KernelsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
  • Show More Cited By

Index Terms

  1. Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
    February 2023
    480 pages
    ISBN:9798400700156
    DOI:10.1145/3572848
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 February 2023

    Check for updates

    Author Tags

    1. GPU
    2. load-balancing
    3. matrix-multiplication

    Qualifiers

    • Poster

    Conference

    PPoPP '23

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)327
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
    • (2024)Research on GRAPES Semi-Implicit Semi-Lagrangian Computation Optimization Based on CPU+GPU Heterogeneity2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704385(411-417)Online publication date: 19-Jul-2024
    • (2024)A Framework for Fine-Grained Synchronization of Dependent GPU KernelsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
    • (2023)A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based ModelsIEEE Computer Architecture Letters10.1109/LCA.2023.332348222:2(169-172)Online publication date: 1-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media