skip to main content
10.1145/3079079.3079086acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU

Published: 14 June 2017 Publication History

Abstract

The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.

References

[1]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proc. SC14. IEEE Press, 781--792.
[2]
M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047) (2008).
[3]
N. Bell and M. Garland. 2008. Efficient Sparse Matrix-Vector Multiplication on CUDA. Technical Report NVR-2008-004. NVIDIA.
[4]
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. SC09. ACM, 1--11.
[5]
B. Catanzaro, A. Keller, and M. Garland. 2014. A Decomposition for In-place Matrix Transposition. SIGPLAN Not. 49, 8 (Feb. 2014), 193--206.
[6]
J. W. Choi, A. Singh, and R. W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010), 115--126.
[7]
M. Daga and J. L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In Proc. HiPC 2015. 64--74.
[8]
T. A. Davis and Y. Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (Dec. 2011), 1:1--1:25.
[9]
S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo. to appear. Sparse matrix-vector multiplication on GPGPUs. ACM Trans. Math. Softw. (to appear).
[10]
M. Garland. 2008. Sparse Matrix Computations on Manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC '08). ACM, 2--6.
[11]
J. L. Greathouse and M. Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proc. SC '14. IEEE Press, 769--780.
[12]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.
[13]
W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proc. ICS '15. ACM, 339--350.
[14]
Y. Liu and B. Schmidt. 2015. LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs. In ASAP 2015. 82--89.
[15]
D. Merrill and M. Garland. 2016. Merge-based Parallel Sparse Matrix-vector Multiplication. In Proc. SC '16. IEEE Press, 58:1--58:12.
[16]
A. Monakov, A. Lokhmotov, and A. Avetisyan. 2010. Automatically Tuning Sparse Matrix-vector Multiplication for GPU Architectures. In Proc. HiPEAC'10. 111--125.
[17]
NVIDIA. 2016. The API reference guide for cuSPARSE, the CUDA sparse matrix library. (v8.0 ed.). NVIDIA.
[18]
Nvidia. 2016. CUDA Programming guide 8.0. (2016).
[19]
J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez. 2012. Optimization of Sparse Matrix-vector Multiplication Using Reordering Techniques on GPUs. Microprocess. Microsyst. 36, 2 (March 2012), 65--77.
[20]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. 2007. Scan Primitives for GPU Computing. In Proc. GH '07. 97--106.
[21]
M. Steinberger, A. Derler, R. Zayer, and H. P. Seidel. 2016. How naive is naive SpMV on the GPU?. In Proc. HPEC 2016. 1--8.
[22]
B.-Y. Su and K. Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMV Frame-work on GPUs. In Proc. ICS '12. ACM, 353--364.
[23]
X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao. 2011. Optimizing SpMV for Diagonal Sparse Matrices on GPU. In proc. ICPP 2011. 492--501.
[24]
W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S.-H. Kuo, R. S. M. Goh, S. J. Turner, and W.-F. Wong. 2013. Accelerating Sparse Matrix-vector Multiplication on GPUs Using Bit-representation-optimized Schemes. In Proc. SC '13. ACM, 26:1--26:12.
[25]
F. Vázquez, J. J. Fernández, and E. M. Garzón. 2011. A New Approach for Sparse Matrix Vector Product on NVIDIA GPUs. Concurr. Comput. : Pract. Exper. 23, 8 (June 2011), 815--826.
[26]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. 2009. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. Parallel Comput. 35, 3 (March 2009), 178--194.
[27]
S. Yan, C. Li, Y. Zhang, and H. Zhou. 2014. yaSpMV: Yet Another SpMV Frame-work on GPUs. In Proc. PPoPP '14. ACM, 107--118.
[28]
H. Yoshizawa and D. Takahashi. 2012. Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs. In IEEE CSE 2012. 130--136.

Cited By

View all
  • (2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
  • (2025)Recursive Hybrid Compression for Sparse Matrix‐Vector Multiplication on GPUConcurrency and Computation: Practice and Experience10.1002/cpe.836637:4-5Online publication date: 10-Feb-2025
  • (2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
  • Show More Cited By

Index Terms

  1. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '17: Proceedings of the International Conference on Supercomputing
    June 2017
    300 pages
    ISBN:9781450350204
    DOI:10.1145/3079079
    • General Chairs:
    • William D. Gropp,
    • Pete Beckman,
    • Program Chairs:
    • Zhiyuan Li,
    • Francisco J. Cazorla
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. SpMV
    3. linear algebra
    4. sparse matrix

    Qualifiers

    • Research-article

    Conference

    ICS '17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)66
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
    • (2025)Recursive Hybrid Compression for Sparse Matrix‐Vector Multiplication on GPUConcurrency and Computation: Practice and Experience10.1002/cpe.836637:4-5Online publication date: 10-Feb-2025
    • (2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
    • (2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
    • (2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
    • (2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: Dec-2024
    • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
    • (2023)Efficient Algorithm Design of Optimizing SpMV on GPUProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593002(115-128)Online publication date: 7-Aug-2023
    • (2023)Accelerating Sparse Data Orchestration via Dynamic Reflexive TilingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582064(18-32)Online publication date: 25-Mar-2023
    • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media