poster

Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

Authors:

John D. OwensAuthors Info & Claims

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 429 - 431

https://doi.org/10.1145/3572848.3577479

Published: 21 February 2023 Publication History

Get Access

Abstract

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.

References

[1]

Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Software 42, 3 (June 2016), 1--31.

Digital Library

Google Scholar

[2]

Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. (2017). https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/

Google Scholar

[3]

Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. The International Journal of High Performance Computing Applications 24, 4 (Nov. 2010), 511--515.

Digital Library

Google Scholar

[4]

NVIDIA Corporation. 2007--2022. CUDA C++ Programming Guide. (Dec. 2007--2022). https://docs.nvidia.com/cuda/ PG-02829-001_v12.0.

Google Scholar

Cited By

View all

Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Ning ZHuang JBian HTan Z(2024)Research on GRAPES Semi-Implicit Semi-Lagrangian Computation Optimization Based on CPU+GPU Heterogeneity2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704385(411-417)Online publication date: 19-Jul-2024
https://doi.org/10.1109/ECNCT63103.2024.10704385
Jangda AMaleki SDehnavi MMusuvathi MSaarikivi OGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)A Framework for Fine-Grained Synchronization of Dependent GPU KernelsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444873
Show More Cited By

Index Terms

Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Parallel ILU preconditioners in GPU computation

Accelerating large-scale linear solvers is always crucial for scientific research and industrial applications. In this regard, preconditioners play a key role in improving the performance of iterative linear solvers. This paper presents a summary and ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...

Comments

Information & Contributors

Information

Published In

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 2023

480 pages

ISBN:9798400700156

DOI:10.1145/3572848

General Chair:
Maryam Mehri Dehnavi
University of Toronto
,
Program Chairs:
Milind Kulkarni
Purdue University
,
Sriram Krishnamoorthy
Google

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Check for updates

Author Tags

Qualifiers

Poster

Conference

PPoPP '23

Sponsor:

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
605
Total Downloads

Downloads (Last 12 months)327
Downloads (Last 6 weeks)34

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Ning ZHuang JBian HTan Z(2024)Research on GRAPES Semi-Implicit Semi-Lagrangian Computation Optimization Based on CPU+GPU Heterogeneity2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704385(411-417)Online publication date: 19-Jul-2024
https://doi.org/10.1109/ECNCT63103.2024.10704385
Jangda AMaleki SDehnavi MMusuvathi MSaarikivi OGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)A Framework for Fine-Grained Synchronization of Dependent GPU KernelsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444873(93-105)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444873
Li HChoi JKwon YAhn J(2023)A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based ModelsIEEE Computer Architecture Letters10.1109/LCA.2023.332348222:2(169-172)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/LCA.2023.3323482

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing stencil application on multi-thread GPU architecture using stream programming model

Parallel ILU preconditioners in GPU computation

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations