skip to main content
research-article
Public Access

Load-balancing Sparse Matrix Vector Product Kernels on GPUs

Published: 29 March 2020 Publication History

Abstract

Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.

References

[1]
Edward Anderson, Zhaojun Bai, Jack Dongarra, Anne Greenbaum, Alan McKenney, Jeremy Du Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society Press, Los Alamitos, CA, 2–11. http://dl.acm.org/citation.cfm?id=110382.110385.
[2]
Hartwig Anzt, Yen-Chen Chen, Terry Cojean, Jack Dongarra, Goran Flegar, Pratik Nayak, Enrique S. Quintana-Ortí, Yuhsiang M. Tsai, and Weichung Wang. 2019. Towards continuous benchmarking: An automated performance evaluation framework for high performance software. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’19). ACM, New York, NY, Article 9, 11 pages.
[3]
Hartwig Anzt, Edmond Chow, and Jack Dongarra. 2016. On Block-asynchronous Execution on GPUs. Technical Report 291. LAPACK Working Note.
[4]
Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. 68 (Oct. 2017), 32–44.
[5]
Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2014. Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ Formats on NVIDIA GPUs. Technical Report ut-eecs-14-727. University of Tennessee.
[6]
Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Viktor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA.
[7]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, Article 18, 11 pages.
[8]
Better Scientific Software (BSSw). Retrieved August 2018 from https://bssw.io/.
[9]
Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Comput. Electr. Eng. 40, 1 (Jan. 2014), 16–28.
[10]
Gianna M. Del Corso. 1997. Estimating an eigenvector by the power method with a random start. SIAM J. Matrix Anal. Appl. 18, 4 (Oct. 1997), 913–937.
[11]
Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and Michael Garland. 2015. Optimizing sparse matrix operations on GPUs using merge path. In 2015 IEEE International Parallel and Distributed Processing Symposium. 407–416.
[12]
Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages.
[13]
Goran Flegar and Hartwig Anzt. 2017. Overcoming load imbalance for irregular sparse matrices. In Proceedings of the 7th Workshop on Irregular Applications: Architectures and Algorithms (IA3’17). ACM, New York, NY, Article 2, 8 pages.
[14]
Goran Flegar and Enrique S. Quintana-Ortí. 2017. Balanced CSR sparse matrix-vector product on graphics processors. In Euro-Par 2017: Parallel Processing, Francisco F. Rivera, Tomás F. Pena, and José C. Cabaleiro (Eds.). Springer International Publishing, Cham, 697–709.
[15]
Nicholas Gould and Jennifer Scott. 2016. A note on performance profiles for benchmarking software. ACM Trans. Math. Softw. 43, 2, Article 15 (Aug. 2016), 5 pages.
[16]
Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, and Vivek Sarkar. 2016. A survey of sparse matrix-vector multiplication performance on large matrices. CoRR abs/1608.00636 (2016). arxiv:1608.00636 http://arxiv.org/abs/1608.00636
[17]
Desmond Higham and Nick Higham. 2005. Matlab Guide. Society for Industrial and Applied Mathematics. arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898717891
[18]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, February 16-20, 2019. 300–314.
[19]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, February 12–16, 2011. 267–276.
[20]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD Units. SIAM J. Scientific Computing 36, 5 (2014), C401–C423. arXiv:http://dx.doi.org/10.1137/130930352
[21]
Amy N. Langville and Carl D. Meyer. 2012. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ.
[22]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339–350.
[23]
Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, Article 58, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014982.
[24]
Duane Merrill, Michael Garland, and Andrew S. Grimshaw. 2015. High-performance and scalable GPU graph traversal. TOPC 1, 2 (2015), 14:1–14:30.
[25]
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). Springer-Verlag, Berlin, 111–125.
[26]
NVIDIA Corp.2017. Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE.
[27]
[NVIDIA Corporation 2018] NVIDIA Corporation 2018. NVIDIA CUDA Toolkit (9.2 ed.). NVIDIA Corporation.
[28]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The PageRank citation ranking: Bringing order to the Web. In Proceedings of the 7th International World Wide Web Conference. Brisbane, Australia, 161–172. citeseer.nj.nec.com/page98pagerank.html.
[29]
Tobias Ribizel and Hartwig Anzt. 2019. Approximate and exact selection on GPUs. In The 9th International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Vol. Available online: http://bit.ly/SampleSelectGPU.
[30]
SuiteSparse. 2018. Matrix Collection. Retrieved April 2018 from https://sparse.tamu.edu.
[31]
xSDK. Extreme-scale Scientific Software Development Kit. Retrieved August 2018 from https://xsdk.info/.

Cited By

View all
  • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
  • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
  • (2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: 1-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 7, Issue 1
Special Issue on Innovations in Systems for Irregular Applications, Part 1 and Regular Paper
March 2020
182 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3387354
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2020
Accepted: 01 October 2019
Revised: 01 September 2019
Received: 01 December 2018
Published in TOPC Volume 7, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. Sparse Matrix Vector Product (SpMV)
  3. irregular matrices

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)427
  • Downloads (Last 6 weeks)57
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
  • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
  • (2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: 1-Dec-2024
  • (2024)Shifting Between Compute and Memory Bounds: A Compression-Enabled Roofline ModelSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00047(309-316)Online publication date: 17-Nov-2024
  • (2024)Accelerated Atomistic Kinetic Monte Carlo Simulations of Resistive Memory ArraysProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00097(1-16)Online publication date: 17-Nov-2024
  • (2024)Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00064(1-16)Online publication date: 17-Nov-2024
  • (2024)AmgT: Algebraic Multigrid Solver on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2024)pSpMv: precision-based sparse matrix partition and SpMV optimizationCCF Transactions on High Performance Computing10.1007/s42514-024-00195-xOnline publication date: 16-Dec-2024
  • (2023)Compressed basis GMRES on high-performance graphics processing unitsInternational Journal of High Performance Computing Applications10.1177/1094342022111514037:2(82-100)Online publication date: 1-Mar-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media