Abstract
General sparse matrix-sparse matrix (SpGEMM) multiplication is one of the basic kernels in a great many applications. Several works focus on various optimizations for SpGEMM. To fully exploit the powerful computing capability of the Sunway TaihuLight supercomputer for SpGEMM, this paper designs the partitioning method and parallelization of CSR-based SpGEMM to make it well match to the Sunway architecture. In addition, this paper optimizes the partitioning method based on the distribution of the floating-point calculations of the CSR-based SpGEMM to achieve the load balance and performance improvement on the Sunway. We, respectively, analyze the performance, including the memory footprint and the execution time, of the parallel CSR-based SpGEMM and the optimized CSR-based SpGEMM on the Sunway. The experimental results show that the optimized CSR-based SpGEMM outperforms over the parallel CSR-based SpGEMM and has good scalability on the Sunway.
Similar content being viewed by others
References
Akbudak K, Aykanat C (2014) Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix–matrix multiplication. SIAM J. Sci. Comput. 36(5):C568–C590
Akbudak K, Selvitopi RO, Aykanat C (2018) Partitioning models for scaling parallel sparse matrix–matrix multiplication. TOPC 4(3):13:1–13:34
Ballard G, Druinsky A, Knight N, Schwartz O (2015) Brief announcement: hypergraph partitioning for parallel sparse matrix-matrix multiplication. In: Proceedings of the 27th ACM on symposium on parallelism in algorithms and architectures, SPAA 2015, Portland, OR, USA, June 13–15, pp 86–88
Chen J, Li K, Bilal K, Metwally AA, Li K, Yu P (2018) Parallel protein community detection in large-scale ppi networks based on multi-source learning. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2018.2868088
Chen J, Li K, Bilal K, Zhou X, Li K, Yu P (2018) A bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE Trans Parallel Distrib Syst. https://doi.org/10.1109/TPDS.2018.2877359
Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2018) A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933
Chen Y, Li K, Fei X, Quan Z, Li K (2018) Implementation and optimization of a data protecting model on the sunway taihulight supercomputer with heterogeneous many-core processors. Concurr Comput Pract Exp. https://doi.org/10.1002/cpe.4758
Chen Y, Li K, Yang W, Xiao G, Xie X, Li T (2018) Performance-aware model for sparse matrix–matrix multiplication on the sunway taihulight supercomputer. IEEE Trans Parallel Distrib Syst. https://doi.org/10.1109/TPDS.2018.2871189
Cheshmi K, Kamil S, Strout M.M, Dehnavi M.M (2018) Parsy: inspection and transformation of sparse matrix computations for parallelism. In: Proceedings of the international conference for high performance computing, networking, storage, and analysis, SC 2018, Dallas, TX, USA, November 11–16, 2018, pp 62:1–62:15
Graf D, Labib K, Uznanski P (2018) Brief announcement: Hamming distance completeness and sparse matrix multiplication. In: 45th International colloquium on automata, languages, and programming, ICALP 2018, July 9–13, 2018, Prague, Czech Republic, pp 109:1–109:4
Greathouse JL, Daga M (2014) Efficient sparse matrix-vector multiplication on gpus using the CSR storage format. In: International conference for high performance computing, networking, storage and analysis, SC 2014, New Orleans, LA, USA, November 16–21, 2014, pp 769–780
Hong C, Sukumaran-Rajam A, Bandyopadhyay B, Kim J, Kurt SE, Nisa I, Sabhlok S, Çatalyürek ÜV, Parthasarathy S, Sadayappan P (2018) Efficient sparse-matrix multi-vector product on gpus. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing, HPDC 2018, Tempe, AZ, USA, June 11–15, 2018, pp 66–79
Kannan R, Ballard G, Park H (2018) MPI-FAUN: an mpi-based framework for alternating–updating nonnegative matrix factorization. IEEE Trans Knowl Data Eng 30(3):544–558
Kaya O, Kannan R, Ballard G (2018) Partitioning and communication strategies for sparse non-negative matrix factorization. In: Proceedings of the 47th international conference on parallel processing, ICPP 2018, Eugene, OR, USA, August 13–16, 2018, pp 90:1–90:10
Kaya O, Uçar B (2015) Scalable sparse tensor decompositions in distributed memory systems. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, pp 77:1–77:11
Li K, Yang W, Li K (2015) Performance analysis and optimization for spmv on GPU using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26(1):196–205
Li K, Yang W, Li K (2016) A hybrid parallel solving algorithm on GPU for quasi-tridiagonal system of linear equations. IEEE Trans Parallel Distrib Syst 27(10):2795–2808
Liu C, Xie B, Liu X, Xue W, Yang H, Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 32nd international conference on supercomputing, ICS 2018, Beijing, China, June 12–15, 2018, pp 363–373
Liu J, He X, Liu W, Tan G (2018) Register-based implementation of the sparse general matrix–matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP 2018, Vienna, Austria, February 24–28, 2018, pp 407–408
Ordonez C, Zhang Y, Cabrera W (2016) The gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans Knowl Data Eng 28(7):1905–1918
Pal S, Beaumont J, Park D, Amarnath A, Feng S, Chakrabarti C, Kim H, Blaauw DT, Mudge TN, Dreslinski RG (2018) Outerspace: an outer product based sparse matrix multiplication accelerator. In: IEEE international symposium on high performance computer architecture, HPCA 2018, Vienna, Austria, February 24–28, 2018, pp 724–736
Pichon G, Faverge M, Ramet P, Roman J (2017) Reordering strategy for blocking optimization in sparse linear solvers. SIAM J Matrix Anal Appl 38(1):226–248
Schaub MT, Trefois M, Dooren PV, Delvenne J (2017) Sparse matrix factorizations for fast linear solvers with application to laplacian systems. SIAM J Matrix Anal Appl 38(2):505–529
Sulatycke P, Ghose K (1998) Caching-efficient multithreaded fast multiplication of sparse matrices. In: IPPS/SPDP, pp 117–123
Sun Q, Zhang C, Wu C, Zhang J, Li L (2018) Bandwidth reduced parallel spmv on the SW26010 many-core platform. In: Proceedings of the 47th international conference on parallel processing, ICPP 2018, Eugene, OR, USA, August 13–16, 2018, pp 54:1–54:10
Wang S, Liu J, Shroff NB (2018) Coded sparse matrix multiplication. In: Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, pp 5139–5147
Xiao G, Li K, Li K (2017) Reporting l most influential objects in uncertain databases based on probabilistic reverse top-k queries. Inf Sci 405:207–226
Xiao G, Li K, Li K, Zhou X (2015) Efficient top-(k, l) range query processing for uncertain data based on multicore architectures. Distrib Parallel Databases 33(3):381–413
Xiao G, Li K, Zhou X, Li K (2017) Efficient monochromatic and bichromatic probabilistic reverse top-k query processing for uncertain big data. J Comput Syst Sci 89:92–113
Yang W, Li K, Mo Z, Li K (2015) Performance optimization using partitioned spmv on gpus and multicore cpus. IEEE Trans Comput 64(9):2623–2636
Zhang J, Gruenwald L (2018) Regularizing irregularity: bitmap-based and portable sparse matrix multiplication for graph data on gpus. In: Proceedings of the 1st ACM SIGMOD joint international workshop on graph data management experiences and systems (GRADES) and network data analytics (NDA), Houston, TX, USA, June 10, 2018, pp 4:1–4:8
Zhao Y, Li J, Liao C, Shen X (2017) POSTER: bridging the gap between deep learning and sparse matrix format selection. In: 26th International conference on parallel architectures and compilation techniques, PACT 2017, Portland, OR, USA, September 9–13, 2017, pp 152–153
Zheng D, Mhembere D, Lyzinski V, Vogelstein JT, Priebe CE, Burns RC (2017) Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans Parallel Distrib Syst 28(5):1470–1483
Acknowledgements
The research was partially funded by the National Key R&D Program of China (Grant No. 2018YFB0203800), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International (Regional) Cooperation and Exchange Program of National Natural Science Foundation of China (Grant Nos. 61661146006, 61860206011), the Program of National Natural Science Foundation of China (Grant Nos. 61572175, 61806077), the Program of Hunan Provincial Innovation Foundation for Postgraduate (Grant No. CX2018B230), the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (Grant No. OCPC2017032), and the Fellowship Program of China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, Y., Xiao, G. & Yang, W. Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight. Neural Comput & Applic 32, 5571–5582 (2020). https://doi.org/10.1007/s00521-019-04121-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04121-z