research-article

IA-SpGEMM: an input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication

Authors:

Ninghui SunAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 94 - 105

https://doi.org/10.1145/3330345.3330354

Published: 26 June 2019 Publication History

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientific applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and algorithm that yield the best performance for SpGEMM also remain undetermined.

In this work, we conduct a prospective study on format-specific parallel SpGEMM algorithms, and analyze their pros and cons. We then propose IA-SpGEMM, an input-aware auto-tuning Framework for SpGEMM, that provides a unified programming interface in the CSR format and automatically determines the best format and algorithm for arbitrary sparse matrices. For this purpose, we set-up an algorithm set and design a deep learning model called MatNet that is trained by over 2,700 matrices from the SuiteSparse Matrix Collection to quickly and accurately predict the best solution by using sparse features and density representations. We evaluate our framework on CPUs and a GPU, and the results show that IA-SpGEMM is on average 3.27x and 13.17x faster than MKL on an Intel and an AMD platform, respectively, and is 2.23x faster than cuSPARSE on an NVIDIA GPU.

References

[1]

Walid Abu-Sufah and Asma Abdel Karim. 2013. Auto-tuning of sparse matrix-vector multiplication on graphics processors. In International Supercomputing Conference. Springer, 151--164.

[2]

Khalid Ahmad, Anand Venkat, and Mary Hall. 2016. Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 218--232.

[3]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM international conference on Supercomputing. ACM, 273--282.

Digital Library

[4]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2015. A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on GPUs. J. Parallel and Distrib. Comput. 76 (2015), 3--15.

Digital Library

[5]

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 38, 6 (2016), C624--C651.

[6]

Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides, and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic acids research 46, 6 (2018), e33--e33.

[7]

David Bader. {n. d.}. Graph BLAS Forum. https://graphblas.org. ({n. d.}).

[8]

Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2015. Brief announcement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. ACM, 86--88.

Digital Library

[9]

Grey Ballard, Christopher Siefert, and Jonathan Hu. 2016. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38, 3 (2016), C203--C231.

Digital Library

[10]

Nathan Bell, Steven Dalton, and Luke N Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing 34, 4 (2012), C123--C152.

Digital Library

[11]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on through put-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis. ACM, 18.

Digital Library

[12]

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. 1984. Classification and Regression Trees. Taylor & Francis. https://books.google.com/books?id=JwQx-WOmSyQC

[13]

Aydin Buluç and John R Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal on Scientific Computing 34, 4 (2012), C170--C191.

Digital Library

[14]

Hong Cheng, Zicheng Liu, Lu Yang, and Xuewen Chen. 2013. Sparse representation and learning in visual recognition: Theory and applications. Signal Processing 93, 6 (2013), 1408--1425.

Digital Library

[15]

Jee W Choi, Amik Singh, and Richard W Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM sigplan notices, Vol. 45. ACM, 115--126.

Digital Library

[16]

Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Transactions on Mathematical Software (TOMS) 41, 4 (2015), 25.

Digital Library

[17]

Julien Demouth. 2012. Sparse matrix-matrix multiplication on the gpu. In Proceedings of the GPU Technology Conference, Vol. 3.

[18]

Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2017. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 693--702.

[19]

Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2018. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (2018), 33--46.

Digital Library

[20]

John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM J. Matrix Anal. Appl. 13, 1 (1992), 333--356.

Digital Library

[21]

John R Gilbert, Steve Reinhardt, and Viral B Shah. 2008. A unified framework for numerical and combinatorial computing. Computing in Science & Engineering 10, 2 (2008).

Digital Library

[22]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587.

Digital Library

[23]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315--323.

[24]

Felix Gremse, Andreas Hofter, Lars Ole Schwen, Fabian Kiessling, and Uwe Naumann. 2015. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54--C71.

Digital Library

[25]

Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS) 4, 3 (1978), 250--269.

Digital Library

[26]

R Intel. 2019. Intel math kernel library reference manual. Technical Report. Tech. Rep.{Online}. Available: https://software.intel.com/sites/default/files/mkl-2019-developer-reference-c_0.pdf.

[27]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423.

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

Digital Library

[29]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.

[30]

Yann LeCun, D Touresky, G Hinton, and T Sejnowski. 1988. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, Vol. 1. CMU, Pittsburgh, Pa: Morgan Kaufmann, 21--28.

[31]

Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 26:1--26:14.

Digital Library

[32]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In ACM SIGPLAN Notices, Vol. 48. ACM, 117--126.

Digital Library

[33]

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 806--814.

[34]

Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. Dissertation. University of Copenhagen.

[35]

Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244-n/a.

[36]

Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). 370--381.

Digital Library

[37]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). 339--350.

Digital Library

[38]

Weifeng Liu and Brian Vinter. 2015. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. J. Parallel and Distrib. Comput. 85, C (2015), 47--61.

Digital Library

[39]

Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heterogeneous Processors. Parallel Comput. 49, C (2015), 179--193.

Digital Library

[40]

Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 111--125.

Digital Library

[41]

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydin Buluç. 2018. High-performance sparse matrix-matrix products on Intel KNL and multicore architectures. arXiv preprint arXiv:1804.01698 (2018).

Digital Library

[42]

Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 101--110.

[43]

Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In International Conference on High Performance Computing. Springer, 48--57.

[44]

Karl Rupp, Florian Rudolf, and Josef Weinbub. 2010. ViennaCL-a high level linear algebra library for GPUs and multi-core CPUs. In Intl. Workshop on GPUs and Scientific Applications. 51--56.

[45]

Youcef Saad. 1990. SPARSKIT: A basic tool kit for sparse matrix computations. (1990).

[46]

Dominik Scherer, Andreas Müller, and Sven Behnke. 2010. Evaluation of pooling operations in convolutional architectures for object recognition. In Artificial Neural Networks-ICANN 2010. Springer, 92--101.

[47]

Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks 61 (2015), 85--117.

Digital Library

[48]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 99--108.

Digital Library

[49]

Viral B Shah. 2007. An interactive system for combinatorial scientific computing with an emphasis on programmer productivity. University of California, Santa Barbara.

Digital Library

[50]

FS Smailbegovic, Georgi N Gaydadjiev, and Stamatis Vassiliadis. 2005. Sparse matrix storage format. In Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc, Vol. 2005. 445--448.

[51]

Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 353--364.

Digital Library

[52]

Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture. ACM Trans. Math. Softw. 44, 4 (2018), 46:1--46:25.

Digital Library

[53]

Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. In ACM SIGPLAN Notices, Vol. 50. ACM, 521--532.

Digital Library

[54]

Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 41.

[55]

Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). 338--353.

Digital Library

[56]

Liyang Wei, Yongyi Yang, Robert M Nishikawa, and Yulei Jiang. 2005. A study on several machine-learning methods for classification of malignant and benign clustered microcalcifications. IEEE transactions on medical imaging 24, 3 (2005), 371--380.

[57]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: yet another SpMV framework on GPUs. In PPOPP.

Digital Library

[58]

Raphael Yuster and Uri Zwick. 2004. Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 254--260.

Digital Library

[59]

Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.

[60]

Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the gpu microarchitecture to achieve bare-metal performance tuning. ACM SIGPLAN Notices 52, 8 (2017), 31--43.

Digital Library

[61]

Yue Zhao, Chunhua Liao, Jiajia Li, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 94--108.

Digital Library

Cited By

Meng XWang XLi MTan GJia W(2025)An interpretable DeePMD-kit performance model for emerging supercomputersCCF Transactions on High Performance Computing10.1007/s42514-024-00209-8Online publication date: 17-Feb-2025
https://doi.org/10.1007/s42514-024-00209-8
Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Show More Cited By

Recommendations

TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. The existing parallel approaches for shared memory SpGEMM mostly ...
A Systematic Survey of General Sparse Matrix-matrix Multiplication
General Sparse Matrix-Matrix Multiplication (SpGEMM) has attracted much attention from researchers in graph analyzing, scientific computing, and deep learning. Many optimization techniques have been developed for different applications and computing ...
Adaptive sparse tiling for sparse matrix multiplication
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

51
Total Citations
View Citations
793
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Meng XWang XLi MTan GJia W(2025)An interpretable DeePMD-kit performance model for emerging supercomputersCCF Transactions on High Performance Computing10.1007/s42514-024-00209-8Online publication date: 17-Feb-2025
https://doi.org/10.1007/s42514-024-00209-8
Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Shi ZZou YSong XLi SLiu FXue Q(2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3488053
Lu YZeng LWang TFu XLi WCheng HYang DJin ZCasas MLiu W(2024)AmgT: Algebraic Multigrid Solver on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00058
Ren JXu DYang SZhao JLi ZNavasca CWang CXu HLi D(2024)Enabling Large Dynamic Neural Network Training with Learning-based Memory Management2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00066(788-802)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00066
Sgherzi FSiracusa MFernandez IArmejach AMoretó M(2024)SpChar: Characterizing the sparse puzzle via decision treesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104941(104941)Online publication date: Jun-2024
https://doi.org/10.1016/j.jpdc.2024.104941
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Jang MKo YGwon HJo IPark YKim SFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)SAGE: A Storage-Based Approach for Scalable and Efficient Sparse Generalized Matrix-Matrix MultiplicationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615044(923-933)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615044
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten