skip to main content
10.1145/3503221.3508431acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

Authors Info & Claims
Published:28 March 2022Publication History

ABSTRACT

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection.

We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. Thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. Thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. Thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

References

  1. K. Akbudak and C. Aykanat. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 28(8):2258--2271, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Akbudak, O. Selvitopi, and C. Aykanat. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 4(3), 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. N. Q. Anh, R. Fan, and Y. Wen. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In ICS '16, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38(6):C624--C651, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Azad and A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In IPDPS '17, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Azad, A. Buluç, and J. R. Gilbert. Parallel triangle counting and enumeration using matrix algebra. In Workshop on Graph Algorithm Building Blocks '15, pages 804 -- 811, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research (NAR), 2018.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç. Combinatorial blas 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems, 33(4):989--1001, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang. Challenges of scaling algebraic multigrid across modern multicore architectures. In IPDPS '11, pages 275--286, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA '13, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1--155, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 3(3), 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Ballard, C. Siefert, and J. Hu. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing, 38(3):C203--C231, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, pages 1--11, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: An asynchronous multi-gpu programming model for irregular computations. In PPoPP '17, pages 235--248, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Bikshandi, B. B. Fraguela, J. Guo, M. J. Garzarán, G. Almási, J. Moreira, and D. Padua. Implementation of parallel numerical algorithms using hierarchically tiled arrays. In Languages and Compilers for High Performance Computing, pages 87--101, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Bisson and M. Fatica. A gpu implementation of the sparse deep neural network graph challenge. In HPEC '19, pages 1--8, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. C. Brodman, G. C. Evans, M. Manguoglu, A. Sameh, M.J. Garzarán, and D. Padua. A parallel numerical solver using hierarchically tiled arrays. In LCPC '11, pages 46--61, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. Buluç. Linear Algebraic Primitives for Parallel Computing on Large Graphs. PhD thesis, University of California, Santa Barbara, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, pages 233--244, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Buluç and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP '08, pages 503--510, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IPDPS '08, 2008.Google ScholarGoogle Scholar
  24. A. Buluç and J. R. Gilbert. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing, 34(4):170--191, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the GraphBLAS API for C. In Workshop on Graph Algorithm Building Blocks, 2017.Google ScholarGoogle Scholar
  26. A. Buluç and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  27. A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, pages 721--733, 2011.Google ScholarGoogle Scholar
  28. Y. Chen, A. B. Hayes, C. Zhang, T. Salmon, and E. Z. Zhang. Locality-aware software throttling for sparse matrix operation on gpus. In USENIX ATC '18, pages 413--425, 2018.Google ScholarGoogle Scholar
  29. Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li. Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 30(4):923--938, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Chou, F. Kjolstad, and S. Amarasinghe. Automatic generation of efficient sparse tensor format conversion routines. In PLDI '20, pages 823--838, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Corp. The cusparse library, 2020.Google ScholarGoogle Scholar
  32. S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland. Optimizing sparse matrix operations on gpus using merge path. In IPDPS '15, pages 407--416, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014.Google ScholarGoogle Scholar
  34. S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4), 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. A. Davis. Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss. In HPEC '18, pages 1--6, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  36. T. A. Davis. Algorithm 1000: Suitesparse:graphblas: Graph algorithms in the language of sparse linear algebra. ACM Trans. Math. Softw., 45(4), 2019.Google ScholarGoogle Scholar
  37. T. A. Davis, M. Aznaveh, and S. Kolodziej. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via suitesparse:graphblas. In HPEC '19, pages 1--6, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  38. T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. G. V. Demirci and C. Aykanat. Cartesian partitioning models for 2d and 3d parallel spgemm algorithms. IEEE Transactions on Parallel and Distributed Systems, 31(12):2763--2775, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. V. Demirci and C. Aykanat. Scaling sparse matrix-matrix multiplication in the accumulo database. Distributed and Parallel Databases, 38(1):31--62, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293--312, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  42. M. Deveci, C. Trott, and S. Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In IPDPSW '17, pages 693--702, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  43. M. Deveci, C. Trott, and S. Rajamanickam. Multithreaded sparse matrix-matrix multiplication for many-core and gpu architectures. Parallel Computing, 78:33 -- 46, 2018.Google ScholarGoogle Scholar
  44. I. S. Duff, M. A. Heroux, and R. Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. I. S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 basic linear algebra subprograms for sparse matrices: A user-level interface. ACM Trans. Math. Softw., 23(3):379--401, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. A. Ellis and S. Rajamanickam. Scalable inference for sparse deep neural networks using kokkos kernels. In HPEC '19, pages 1--7, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  47. R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In ICCS '02, pages 632--641, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  48. H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on hpc platforms. In ICS '11, pages 172--181, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang. Modeling the performance of an algebraic multigrid cycle using hybrid mpi/openmp. In ICPP '12, pages 128--137, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. I. Gelado and M. Garland. Throughput-oriented gpu memory allocation. In PPoPP '19, page 27--37, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, page 292--303, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann. Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing, 37(1):C54--C71, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. F. Gremse, K. Küpper, and U. Naumann. Memory-efficient sparse matrix-matrix multiplication by row merging on many-core architectures. SIAM Journal on Scientific Computing, 40(4):C429--C449, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  55. Z. Gu, J. Moreira, D. Edelsohn, and A. Azad. Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking. In SPAA '20, pages 293--303, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluç. Parallel string graph construction and transitive reduction for de novo genome assembly. In IPDPS '21, pages 517--526, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  57. F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250--269, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, U. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. In HPDC '18, pages 66--79, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP '19, pages 300--314, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. K. Hou, W. Liu, H. Wang, and W.-c. Feng. Fast segmented sort on gpus. In ICS '17, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. G. Huang, G. Dai, Y. Wang, and H. Yang. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC '20, pages 1--12, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  62. M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  63. M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  64. E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, pages 127--136, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  65. E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1):135--158, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. H. Ji, S. Lu, K. Hou, H. Wang, Z. Jin, W. Liu, and B. Vinter. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming, pages 1--13, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson. Mathematical foundations of the GraphBLAS. In HPEC '16, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  68. J. Kepner, D. Bader, A. Buluç, J. Gilbert, J. Kepner, T. Mattson, and H. Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. In ICCS '15, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. F. Kjolstad, P. Ahrens, S. Kamil, and S. Amarasinghe. Tensor algebra compilation with workspaces. In CGO '19, pages 180--192, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  70. F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In PPAM '14, pages 15--25, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  72. P. Koanantakool, A. Azad, A. Buluç, D. Morozov, S.-Y. Oh, L. Oliker, and K. Yelick. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS '16, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  73. V. Kotlyar, K. Pingali, and P. Stodghill. Compiling parallel code for sparse matrix applications. In SC '97, pages 1--18, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa, and P. Sadayappan. On improving performance of sparse matrix-matrix multiplication on gpus. In ICS '17, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. J. Lee, S. Kang, Y. Yu, Y.-Y. Jo, S.-W. Kim, and Y. Park. Optimization of gpu-based sparse matrix multiplication for large sparse networks. In ICDE '20, pages 925--936, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  76. J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC '18, pages 238--252, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li. A coordinated tiling and batching framework for efficient gemm on gpus. In PPoPP '19, page 229--241, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. J. Liu, X. He, W. Liu, and G. Tan. Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In PPoPP '18, page 407--408, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. J. Liu, X. He, W. Liu, and G. Tan. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In IPDPS '14, pages 370--381, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, pages 339--350, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. Journal of Parallel and Distributed Computing, 85(C):47--61, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C):179--193, 2015.Google ScholarGoogle Scholar
  84. Z. Lu, Y. Niu, and W. Liu. Efficient block algorithms for parallel sparse triangular solve. In ICPP '20, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. S. Maleki, G. C. Evans, and D. A. Padua. Tiled linear algebra a system for parallel graph algorithms. In LCPC '15, pages 116--130, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  86. T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. Standards for graph algorithm primitives. In HPEC '13, pages 1--2, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  87. T. Mattson, T. A. Davis, M. Kumar, A. Buluç, S. McMillan, J. Moreira, and C. Yang. LAGraph: A community effort to collect graph algorithms built on top of the GraphBLAS. In GrAPL Workshop '19, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  88. T. G. Mattson, C. Yang, S. McMillan, A. Buluç, and J. E. Moreira. GraphBLAS C API: Ideas for future versions of the specification. In HPEC '17, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  89. D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, pages 678--689, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  90. D. Merrill and M. Garland. Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In PPoPP '16, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In PLDI '19, pages 594--609, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. High-performance sparse matrix-matrix products on intel KNL and multicore architectures. In ICPPW '18, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing, 90, 2019.Google ScholarGoogle Scholar
  94. Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, pages 68--78, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  95. S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA '18, pages 724--736, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  96. M. Parger, M. Winter, D. Mlakar, and M. Steinberger. Speck: Accelerating gpu sparse matrix-matrix multiplication through lightweight analysis. In PPoPP '20, pages 362--375, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. O. Selvitopi, M. T. Hussain, A. Azad, and A. Buluç. Optimizing high performance markov clustering for pre-exascale architectures. In IPDPS '20, pages 116--126, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  98. M. M. Strout, M. Hall, and C. Olschanowsky. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proceedings of the IEEE, 106(11):1921--1934, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  99. R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16:521--530, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  100. R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  101. R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, pages 807--816, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  102. X. Wang, W. Liu, W. Xue, and L. Wu. swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In PPoPP '18, pages 338--353, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. Dual-side sparse tensor core. In ISCA '21, page 1083--1095, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06, pages 307--316, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger. Adaptive sparse matrix-matrix multiplication on the gpu. In PPoPP '19, pages 68--81, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajamanickam. Fast linear algebra-based triangle counting with kokkoskernels. In HPEC '17, pages 1--7, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  108. Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. Scaling sparse matrix multiplication on cpu-gpu nodes. In IPDPS '21, pages 392--401, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  109. Z. Xie, G. Tan, W. Liu, and N. Sun. Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In ICS '19, pages 94--105, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Z. Xie, G. Tan, W. Liu, and N. Sun. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 33(1):159--175, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. C. Yang, A. Buluç, and J. D. Owens. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. arXiv preprint, 2019.Google ScholarGoogle Scholar
  112. C. Yang, A. Buluç, and J. D. Owens. Design principles for sparse matrix multiplication on the GPU. In Euro-Par '18, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. C. Yang, A. Buluç, and J. D. Owens. Implementing push-pull efficiently in GraphBLAS. In ICPP '18, 2018.Google ScholarGoogle Scholar
  114. A. Yaşar, S. Rajamanickam, J. Berry, M. Wolf, J. S. Young, and V. ÇatalyÜrek. Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : (update on static graph challenge). In HPEC '19, pages 1--4, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  115. S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Speeding up spmv for power-law graph analytics by enhancing locality & vectorization. In SC '20, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  116. K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas? Proceedings of the IEEE, 93(2):358--386, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  117. R. Yuster and U. Zwick. Fast sparse matrix multiplication. ACM Trans. Algorithms, 1(1):2--13, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88:106848, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  119. Z. Zhang, H. Wang, S. Han, and W. J. Dally. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA '20, pages 261--274, 2020.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
            April 2022
            495 pages
            ISBN:9781450392044
            DOI:10.1145/3503221

            Copyright © 2022 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 March 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate230of1,014submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader