research-article

Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU

Authors:

Markus Steinberger,

Hans-Peter SeidelAuthors Info & Claims

ICS '17: Proceedings of the International Conference on Supercomputing

Article No.: 13, Pages 1 - 11

https://doi.org/10.1145/3079079.3079086

Published: 14 June 2017 Publication History

Abstract

The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.

References

[1]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proc. SC14. IEEE Press, 781--792.

Digital Library

[2]

M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047) (2008).

[3]

N. Bell and M. Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. Technical Report NVR-2008-004. NVIDIA.

[4]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. SC09. ACM, 1--11.

Digital Library

[5]

B. Catanzaro, A. Keller, and M. Garland. 2014. A Decomposition for In-place Matrix Transposition. SIGPLAN Not. 49, 8 (Feb. 2014), 193--206.

Digital Library

[6]

J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010), 115--126.

Digital Library

[7]

M. Daga and J. L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In Proc. HiPC 2015. 64--74.

Digital Library

[8]

T. A. Davis and Y. Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (Dec. 2011), 1:1--1:25.

Digital Library

[9]

S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo. to appear. Sparse matrix-vector multiplication on GPGPUs. ACM Trans. Math. Softw. (to appear).

Digital Library

[10]

M. Garland. 2008. Sparse Matrix Computations on Manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC '08). ACM, 2--6.

Digital Library

[11]

J. L. Greathouse and M. Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proc. SC '14. IEEE Press, 769--780.

Digital Library

[12]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.

Digital Library

[13]

W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proc. ICS '15. ACM, 339--350.

Digital Library

[14]

Y. Liu and B. Schmidt. 2015. LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In ASAP 2015. 82--89.

[15]

D. Merrill and M. Garland. 2016. Merge-based Parallel Sparse Matrix-vector Multiplication. In Proc. SC '16. IEEE Press, 58:1--58:12.

Digital Library

[16]

A. Monakov, A. Lokhmotov, and A. Avetisyan. 2010. Automatically Tuning Sparse Matrix-vector Multiplication for GPU Architectures. In Proc. HiPEAC'10. 111--125.

Digital Library

[17]

NVIDIA. 2016. The API reference guide for cuSPARSE, the CUDA sparse matrix library. (v8.0 ed.). NVIDIA.

[18]

Nvidia. 2016. CUDA Programming guide 8.0. (2016).

[19]

J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of Sparse Matrix-vector Multiplication Using Reordering Techniques on GPUs. Microprocess. Microsyst. 36, 2 (March 2012), 65--77.

Digital Library

[20]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. 2007. Scan Primitives for GPU Computing. In Proc. GH '07. 97--106.

Digital Library

[21]

M. Steinberger, A. Derler, R. Zayer, and H. P. Seidel. 2016. How naive is naive SpMV on the GPU?. In Proc. HPEC 2016. 1--8.

[22]

B.-Y. Su and K. Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMV Frame-work on GPUs. In Proc. ICS '12. ACM, 353--364.

Digital Library

[23]

X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao. 2011. Optimizing SpMV for Diagonal Sparse Matrices on GPU. In proc. ICPP 2011. 492--501.

Digital Library

[24]

W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S.-H. Kuo, R. S. M. Goh, S. J. Turner, and W.-F. Wong. 2013. Accelerating Sparse Matrix-vector Multiplication on GPUs Using Bit-representation-optimized Schemes. In Proc. SC '13. ACM, 26:1--26:12.

Digital Library

[25]

F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A New Approach for Sparse Matrix Vector Product on NVIDIA GPUs. Concurr. Comput. : Pract. Exper. 23, 8 (June 2011), 815--826.

Digital Library

[26]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. 2009. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. Parallel Comput. 35, 3 (March 2009), 178--194.

Digital Library

[27]

S. Yan, C. Li, Y. Zhang, and H. Zhou. 2014. yaSpMV: Yet Another SpMV Frame-work on GPUs. In Proc. PPoPP '14. ACM, 107--118.

Digital Library

[28]

H. Yoshizawa and D. Takahashi. 2012. Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs. In IEEE CSE 2012. 130--136.

Digital Library

Cited By

Cao LWang QYang SXia RGuo WLiu J(2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
https://doi.org/10.1007/978-981-96-1551-3_7
Zhao ZWu YZhang GYang YHong R(2025)Recursive Hybrid Compression for Sparse Matrix‐Vector Multiplication on GPUConcurrency and Computation: Practice and Experience10.1002/cpe.836637:4-5Online publication date: 10-Feb-2025
https://doi.org/10.1002/cpe.8366
Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Show More Cited By

Index Terms

Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Merge-based parallel sparse matrix-vector multiplication
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We present a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or ...
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '17: Proceedings of the International Conference on Supercomputing

June 2017

300 pages

ISBN:9781450350204

DOI:10.1145/3079079

General Chairs:
William D. Gropp
University of Illinois at Urbana-Champaign, Illinois
,
Pete Beckman
Argonne National Laboratory/Northwestern University, Illinois
,
Program Chairs:
Zhiyuan Li
Purdue University, West Lafayette, Indiana
,
Francisco J. Cazorla
IIIA-CSIC and Barcelona Supercomputing Center, Barcelona, Spain

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '17

Sponsor:

SIGARCH

ICS '17: 2017 International Conference on Supercomputing

June 14 - 16, 2017

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
521
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cao LWang QYang SXia RGuo WLiu J(2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
https://doi.org/10.1007/978-981-96-1551-3_7
Zhao ZWu YZhang GYang YHong R(2025)Recursive Hybrid Compression for Sparse Matrix‐Vector Multiplication on GPUConcurrency and Computation: Practice and Experience10.1002/cpe.836637:4-5Online publication date: 10-Feb-2025
https://doi.org/10.1002/cpe.8366
Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Pang MFei XQu PZhang YLi ZLee IChabbi MSteuwer M(2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638470
Laut SBorrell RCasas MMencagli GDazzi PLowenthal DBadia R(2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658683
Shi ZZou YSong XLi SLiu FXue Q(2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3488053
Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Chu GHe YDong LDing ZChen DBai HWang XHu CButt AMi NChard K(2023)Efficient Algorithm Design of Optimizing SpMV on GPUProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593002(115-128)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593002
Odemuyiwa TAsghari-Moghaddam HPellauer MHegde KTsai PCrago NJaleel AOwens JSolomonik EEmer JFletcher CAamodt TJerger NSwift M(2023)Accelerating Sparse Data Orchestration via Dynamic Reflexive TilingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582064(18-32)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582064
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten