research-article

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

Authors:
Yuyao Niu

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

,
Zhengyang Lu

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

,
Haonan Ji

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

,
Shuhui Song

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

,
Zhou Jin

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

,
Weifeng Liu

China University of Petroleum-Beijing, Beijing, China

China University of Petroleum-Beijing, Beijing, China
View Profile

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingApril 2022Pages 90–106https://doi.org/10.1145/3503221.3508431

Published:28 March 2022Publication History

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 90–106

ABSTRACT

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection.

We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. Thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. Thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. Thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

References

K. Akbudak and C. Aykanat. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 28(8):2258--2271, 2017.Google ScholarDigital Library
K. Akbudak, O. Selvitopi, and C. Aykanat. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 4(3), 2018.Google ScholarDigital Library
P. N. Q. Anh, R. Fan, and Y. Wen. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In ICS '16, 2016.Google ScholarDigital Library
A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38(6):C624--C651, 2016.Google ScholarCross Ref
A. Azad and A. Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In IPDPS '17, 2017.Google ScholarCross Ref
A. Azad, A. Buluç, and J. R. Gilbert. Parallel triangle counting and enumeration using matrix algebra. In Workshop on Graph Algorithm Building Blocks '15, pages 804 -- 811, 2015.Google ScholarDigital Library
A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research (NAR), 2018.Google ScholarCross Ref
A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç. Combinatorial blas 2.0: Scaling combinatorial algorithms on distributed-memory systems. IEEE Transactions on Parallel and Distributed Systems, 33(4):989--1001, 2022.Google ScholarCross Ref
A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang. Challenges of scaling algebraic multigrid across modern multicore architectures. In IPDPS '11, pages 275--286, 2011.Google ScholarDigital Library
G. Ballard, A. Buluç, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. Communication optimal parallel multiplication of sparse random matrices. In SPAA '13, 2013.Google ScholarDigital Library
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1--155, 2014.Google ScholarCross Ref
G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput., 3(3), 2016.Google ScholarDigital Library
G. Ballard, C. Siefert, and J. Hu. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing, 38(3):C203--C231, 2016.Google ScholarDigital Library
N. Bell, S. Dalton, and L. N. Olson. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, 34(4):C123--C152, 2012.Google ScholarDigital Library
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, pages 1--11, 2009.Google ScholarDigital Library
T. Ben-Nun, M. Sutton, S. Pai, and K. Pingali. Groute: An asynchronous multi-gpu programming model for irregular computations. In PPoPP '17, pages 235--248, 2017.Google ScholarDigital Library
G. Bikshandi, B. B. Fraguela, J. Guo, M. J. Garzarán, G. Almási, J. Moreira, and D. Padua. Implementation of parallel numerical algorithms using hierarchically tiled arrays. In Languages and Compilers for High Performance Computing, pages 87--101, 2005.Google ScholarDigital Library
M. Bisson and M. Fatica. A gpu implementation of the sparse deep neural network graph challenge. In HPEC '19, pages 1--8, 2019.Google ScholarCross Ref
J. C. Brodman, G. C. Evans, M. Manguoglu, A. Sameh, M.J. Garzarán, and D. Padua. A parallel numerical solver using hierarchically tiled arrays. In LCPC '11, pages 46--61, 2011.Google ScholarCross Ref
A. Buluç. Linear Algebraic Primitives for Parallel Computing on Large Graphs. PhD thesis, University of California, Santa Barbara, 2010.Google ScholarDigital Library
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, pages 233--244, 2009.Google ScholarDigital Library
A. Buluç and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP '08, pages 503--510, 2008.Google ScholarDigital Library
A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IPDPS '08, 2008.Google Scholar
A. Buluç and J. R. Gilbert. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing, 34(4):170--191, 2012.Google ScholarDigital Library
A. Buluç, T. Mattson, S. McMillan, J. Moreira, and C. Yang. Design of the GraphBLAS API for C. In Workshop on Graph Algorithm Building Blocks, 2017.Google Scholar
A. Buluç and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. The International Journal of High Performance Computing Applications, 25(4):496--509, 2011.Google ScholarCross Ref
A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, pages 721--733, 2011.Google Scholar
Y. Chen, A. B. Hayes, C. Zhang, T. Salmon, and E. Z. Zhang. Locality-aware software throttling for sparse matrix operation on gpus. In USENIX ATC '18, pages 413--425, 2018.Google Scholar
Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li. Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 30(4):923--938, 2019.Google ScholarDigital Library
S. Chou, F. Kjolstad, and S. Amarasinghe. Automatic generation of efficient sparse tensor format conversion routines. In PLDI '20, pages 823--838, 2020.Google ScholarDigital Library
N. Corp. The cusparse library, 2020.Google Scholar
S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland. Optimizing sparse matrix operations on gpus using merge path. In IPDPS '15, pages 407--416, 2015.Google ScholarDigital Library
S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2014.Google Scholar
S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Trans. Math. Softw., 41(4), 2015.Google ScholarDigital Library
T. A. Davis. Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss. In HPEC '18, pages 1--6, 2018.Google ScholarCross Ref
T. A. Davis. Algorithm 1000: Suitesparse:graphblas: Graph algorithms in the language of sparse linear algebra. ACM Trans. Math. Softw., 45(4), 2019.Google Scholar
T. A. Davis, M. Aznaveh, and S. Kolodziej. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via suitesparse:graphblas. In HPEC '19, pages 1--6, 2019.Google ScholarCross Ref
T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1), 2011.Google ScholarDigital Library
G. V. Demirci and C. Aykanat. Cartesian partitioning models for 2d and 3d parallel spgemm algorithms. IEEE Transactions on Parallel and Distributed Systems, 31(12):2763--2775, 2020.Google ScholarDigital Library
G. V. Demirci and C. Aykanat. Scaling sparse matrix-matrix multiplication in the accumulo database. Distributed and Parallel Databases, 38(1):31--62, 2020.Google ScholarDigital Library
J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2):293--312, 2005.Google ScholarCross Ref
M. Deveci, C. Trott, and S. Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In IPDPSW '17, pages 693--702, 2017.Google ScholarCross Ref
M. Deveci, C. Trott, and S. Rajamanickam. Multithreaded sparse matrix-matrix multiplication for many-core and gpu architectures. Parallel Computing, 78:33 -- 46, 2018.Google Scholar
I. S. Duff, M. A. Heroux, and R. Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, 2002.Google ScholarDigital Library
I. S. Duff, M. Marrone, G. Radicati, and C. Vittoli. Level 3 basic linear algebra subprograms for sparse matrices: A user-level interface. ACM Trans. Math. Softw., 23(3):379--401, 1997.Google ScholarDigital Library
J. A. Ellis and S. Rajamanickam. Scalable inference for sparse deep neural networks using kokkos kernels. In HPEC '19, pages 1--7, 2019.Google ScholarCross Ref
R. D. Falgout and U. M. Yang. hypre: A library of high performance preconditioners. In ICCS '02, pages 632--641, 2002.Google ScholarCross Ref
H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on hpc platforms. In ICS '11, pages 172--181, 2011.Google ScholarDigital Library
H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang. Modeling the performance of an algebraic multigrid cycle using hybrid mpi/openmp. In ICPP '12, pages 128--137, 2012.Google ScholarDigital Library
I. Gelado and M. Garland. Throughput-oriented gpu memory allocation. In PPoPP '19, page 27--37, 2019.Google ScholarDigital Library
J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.Google ScholarDigital Library
C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, page 292--303, 2021.Google ScholarDigital Library
F. Gremse, A. Höfter, L. O. Schwen, F. Kiessling, and U. Naumann. Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing, 37(1):C54--C71, 2015.Google ScholarDigital Library
F. Gremse, K. Küpper, and U. Naumann. Memory-efficient sparse matrix-matrix multiplication by row merging on many-core architectures. SIAM Journal on Scientific Computing, 40(4):C429--C449, 2018.Google ScholarCross Ref
Z. Gu, J. Moreira, D. Edelsohn, and A. Azad. Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking. In SPAA '20, pages 293--303, 2020.Google ScholarDigital Library
G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluç. Parallel string graph construction and transitive reduction for de novo genome assembly. In IPDPS '21, pages 517--526, 2021.Google ScholarCross Ref
F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250--269, 1978.Google ScholarDigital Library
C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, U. V. Çatalyürek, S. Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. In HPDC '18, pages 66--79, 2018.Google ScholarDigital Library
C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP '19, pages 300--314, 2019.Google ScholarDigital Library
K. Hou, W. Liu, H. Wang, and W.-c. Feng. Fast segmented sort on gpus. In ICS '17, 2017.Google ScholarDigital Library
G. Huang, G. Dai, Y. Wang, and H. Yang. Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC '20, pages 1--12, 2020.Google ScholarCross Ref
M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.Google ScholarCross Ref
M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad. Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale. In IPDPS '21, pages 90--100, 2021.Google ScholarCross Ref
E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, pages 127--136, 2001.Google ScholarCross Ref
E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1):135--158, 2004.Google ScholarDigital Library
H. Ji, S. Lu, K. Hou, H. Wang, Z. Jin, W. Liu, and B. Vinter. Segmented merge: A new primitive for parallel sparse matrix computations. International Journal of Parallel Programming, pages 1--13, 2021.Google ScholarDigital Library
J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson. Mathematical foundations of the GraphBLAS. In HPEC '16, 2016.Google ScholarCross Ref
J. Kepner, D. Bader, A. Buluç, J. Gilbert, J. Kepner, T. Mattson, and H. Meyerhenke. Graphs, matrices, and the GraphBLAS: Seven good reasons. In ICCS '15, 2015.Google ScholarDigital Library
F. Kjolstad, P. Ahrens, S. Kamil, and S. Amarasinghe. Tensor algebra compilation with workspaces. In CGO '19, pages 180--192, 2019.Google ScholarCross Ref
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1, 2017.Google ScholarDigital Library
N. Knight, E. Carson, and J. Demmel. Exploiting data sparsity in parallel matrix powers computations. In PPAM '14, pages 15--25, 2014.Google ScholarCross Ref
P. Koanantakool, A. Azad, A. Buluç, D. Morozov, S.-Y. Oh, L. Oliker, and K. Yelick. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS '16, 2016.Google ScholarCross Ref
V. Kotlyar, K. Pingali, and P. Stodghill. Compiling parallel code for sparse matrix applications. In SC '97, pages 1--18, 1997.Google ScholarDigital Library
R. Kunchum, A. Chaudhry, A. Sukumaran-Rajam, Q. Niu, I. Nisa, and P. Sadayappan. On improving performance of sparse matrix-matrix multiplication on gpus. In ICS '17, 2017.Google ScholarDigital Library
J. Lee, S. Kang, Y. Yu, Y.-Y. Jo, S.-W. Kim, and Y. Park. Optimization of gpu-based sparse matrix multiplication for large sparse networks. In ICDE '20, pages 925--936, 2020.Google ScholarCross Ref
J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC '18, pages 238--252, 2018.Google ScholarDigital Library
X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li. A coordinated tiling and batching framework for efficient gemm on gpus. In PPoPP '19, page 229--241, 2019.Google ScholarDigital Library
J. Liu, X. He, W. Liu, and G. Tan. Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In PPoPP '18, page 407--408, 2018.Google ScholarDigital Library
J. Liu, X. He, W. Liu, and G. Tan. Register-aware optimizations for parallel sparse matrix-matrix multiplication. International Journal of Parallel Programming, 2019.Google ScholarDigital Library
W. Liu and B. Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In IPDPS '14, pages 370--381, 2014.Google ScholarDigital Library
W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, pages 339--350, 2015.Google ScholarDigital Library
W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. Journal of Parallel and Distributed Computing, 85(C):47--61, 2015.Google ScholarDigital Library
W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C):179--193, 2015.Google Scholar
Z. Lu, Y. Niu, and W. Liu. Efficient block algorithms for parallel sparse triangular solve. In ICPP '20, 2020.Google ScholarDigital Library
S. Maleki, G. C. Evans, and D. A. Padua. Tiled linear algebra a system for parallel graph algorithms. In LCPC '15, pages 116--130, 2015.Google ScholarCross Ref
T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. Standards for graph algorithm primitives. In HPEC '13, pages 1--2, 2013.Google ScholarCross Ref
T. Mattson, T. A. Davis, M. Kumar, A. Buluç, S. McMillan, J. Moreira, and C. Yang. LAGraph: A community effort to collect graph algorithms built on top of the GraphBLAS. In GrAPL Workshop '19, 2019.Google ScholarCross Ref
T. G. Mattson, C. Yang, S. McMillan, A. Buluç, and J. E. Moreira. GraphBLAS C API: Ideas for future versions of the specification. In HPEC '17, 2017.Google ScholarCross Ref
D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, pages 678--689, 2016.Google ScholarCross Ref
D. Merrill and M. Garland. Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In PPoPP '16, 2016.Google ScholarDigital Library
M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In PLDI '19, pages 594--609, 2019.Google ScholarDigital Library
Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. High-performance sparse matrix-matrix products on intel KNL and multicore architectures. In ICPPW '18, 2018.Google ScholarDigital Library
Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing, 90, 2019.Google Scholar
Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, pages 68--78, 2021.Google ScholarCross Ref
S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In HPCA '18, pages 724--736, 2018.Google ScholarCross Ref
M. Parger, M. Winter, D. Mlakar, and M. Steinberger. Speck: Accelerating gpu sparse matrix-matrix multiplication through lightweight analysis. In PPoPP '20, pages 362--375, 2020.Google ScholarDigital Library
O. Selvitopi, M. T. Hussain, A. Azad, and A. Buluç. Optimizing high performance markov clustering for pre-exascale architectures. In IPDPS '20, pages 116--126, 2020.Google ScholarCross Ref
M. M. Strout, M. Hall, and C. Olschanowsky. The sparse polyhedral framework: Composing compiler-generated inspector-executor code. Proceedings of the IEEE, 106(11):1921--1934, 2018.Google ScholarCross Ref
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16:521--530, 2005.Google ScholarCross Ref
R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.Google ScholarCross Ref
R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, pages 807--816, 2005.Google ScholarCross Ref
X. Wang, W. Liu, W. Xue, and L. Wu. swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In PPoPP '18, pages 338--353, 2018.Google ScholarDigital Library
Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. Dual-side sparse tensor core. In ISCA '21, page 1083--1095, 2021.Google ScholarDigital Library
J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06, pages 307--316, 2006.Google ScholarDigital Library
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07, 2007.Google ScholarDigital Library
M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger. Adaptive sparse matrix-matrix multiplication on the gpu. In PPoPP '19, pages 68--81, 2019.Google ScholarDigital Library
M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajamanickam. Fast linear algebra-based triangle counting with kokkoskernels. In HPEC '17, pages 1--7, 2017.Google ScholarCross Ref
Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. Scaling sparse matrix multiplication on cpu-gpu nodes. In IPDPS '21, pages 392--401, 2021.Google ScholarCross Ref
Z. Xie, G. Tan, W. Liu, and N. Sun. Ia-spgemm: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In ICS '19, pages 94--105, 2019.Google ScholarDigital Library
Z. Xie, G. Tan, W. Liu, and N. Sun. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems, 33(1):159--175, 2022.Google ScholarDigital Library
C. Yang, A. Buluç, and J. D. Owens. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU. arXiv preprint, 2019.Google Scholar
C. Yang, A. Buluç, and J. D. Owens. Design principles for sparse matrix multiplication on the GPU. In Euro-Par '18, 2018.Google ScholarDigital Library
C. Yang, A. Buluç, and J. D. Owens. Implementing push-pull efficiently in GraphBLAS. In ICPP '18, 2018.Google Scholar
A. Yaşar, S. Rajamanickam, J. Berry, M. Wolf, J. S. Young, and V. ÇatalyÜrek. Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : (update on static graph challenge). In HPEC '19, pages 1--4, 2019.Google ScholarCross Ref
S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Speeding up spmv for power-law graph analytics by enhancing locality & vectorization. In SC '20, 2020.Google ScholarCross Ref
K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas? Proceedings of the IEEE, 93(2):358--386, 2005.Google ScholarCross Ref
R. Yuster and U. Zwick. Fast sparse matrix multiplication. ACM Trans. Algorithms, 1(1):2--13, 2005.Google ScholarDigital Library
O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88:106848, 2020.Google ScholarCross Ref
Z. Zhang, H. Wang, S. Han, and W. J. Dally. Sparch: Efficient architecture for sparse matrix multiplication. In HPCA '20, pages 261--274, 2020.Google ScholarCross Ref

Index Terms

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
      2. Vector / streaming algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
    2. Solvers

Recommendations

Adaptive sparse matrix-matrix multiplication on the GPU
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, ...
Read More
Accelerating Sparse General Matrix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU
HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Sparse general matrix-matrix multiplication (SpGEMM) is challenging especially on graphic accelerators. Existing solutions do not fully utilize the shared memory of the graphics accelerator. Our proposal could effectively utilize the graphics ...
Read More
On improving performance of sparse matrix-matrix multiplication on GPUs
ICS '17: Proceedings of the International Conference on Supercomputing

Sparse matrix-matrix multiplication (SpGEMM) is an important primitive for many data analytics algorithms, such as Markov clustering. Unlike the dense case, where performance of matrix-matrix multiplication is considerably higher than matrix-vector ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
April 2022
495 pages
ISBN:9781450392044
DOI:10.1145/3503221
General Chair:
Jaejin Lee
Seoul National University
,
Program Chairs:
Kunal Agrawal
Washington University
,
Michael Spear
Lehigh University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
GPU
SpGEMM
sparse matrix
tiled algorithm
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 1,682
  Total Downloads
- Downloads (Last 12 months)701
- Downloads (Last 6 weeks)97
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive sparse matrix-matrix multiplication on the GPU

Accelerating Sparse General Matrix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU

On improving performance of sparse matrix-matrix multiplication on GPUs