Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight

Chen, Yuedan; Xiao, Guoqing; Yang, Wangdong

doi:10.1007/s00521-019-04121-z

Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight

Advances in Parallel and Distributed Computing for Neural Computing
Published: 07 March 2019

Volume 32, pages 5571–5582, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

835 Accesses
11 Citations
Explore all metrics

Abstract

General sparse matrix-sparse matrix (SpGEMM) multiplication is one of the basic kernels in a great many applications. Several works focus on various optimizations for SpGEMM. To fully exploit the powerful computing capability of the Sunway TaihuLight supercomputer for SpGEMM, this paper designs the partitioning method and parallelization of CSR-based SpGEMM to make it well match to the Sunway architecture. In addition, this paper optimizes the partitioning method based on the distribution of the floating-point calculations of the CSR-based SpGEMM to achieve the load balance and performance improvement on the Sunway. We, respectively, analyze the performance, including the memory footprint and the execution time, of the parallel CSR-based SpGEMM and the optimized CSR-based SpGEMM on the Sunway. The experimental results show that the optimized CSR-based SpGEMM outperforms over the parallel CSR-based SpGEMM and has good scalability on the Sunway.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing Parallel Sparse Matrix Transposition Algorithm Using CSR for GPUs

Parallelization of Sparse Matrix Kernels for Big Data Applications

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer

References

Akbudak K, Aykanat C (2014) Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix–matrix multiplication. SIAM J. Sci. Comput. 36(5):C568–C590
Article MathSciNet MATH Google Scholar
Akbudak K, Selvitopi RO, Aykanat C (2018) Partitioning models for scaling parallel sparse matrix–matrix multiplication. TOPC 4(3):13:1–13:34
Article Google Scholar
Ballard G, Druinsky A, Knight N, Schwartz O (2015) Brief announcement: hypergraph partitioning for parallel sparse matrix-matrix multiplication. In: Proceedings of the 27th ACM on symposium on parallelism in algorithms and architectures, SPAA 2015, Portland, OR, USA, June 13–15, pp 86–88
Chen J, Li K, Bilal K, Metwally AA, Li K, Yu P (2018) Parallel protein community detection in large-scale ppi networks based on multi-source learning. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2018.2868088
Chen J, Li K, Bilal K, Zhou X, Li K, Yu P (2018) A bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE Trans Parallel Distrib Syst. https://doi.org/10.1109/TPDS.2018.2877359
Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2018) A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933
Article Google Scholar
Chen Y, Li K, Fei X, Quan Z, Li K (2018) Implementation and optimization of a data protecting model on the sunway taihulight supercomputer with heterogeneous many-core processors. Concurr Comput Pract Exp. https://doi.org/10.1002/cpe.4758
Chen Y, Li K, Yang W, Xiao G, Xie X, Li T (2018) Performance-aware model for sparse matrix–matrix multiplication on the sunway taihulight supercomputer. IEEE Trans Parallel Distrib Syst. https://doi.org/10.1109/TPDS.2018.2871189
Cheshmi K, Kamil S, Strout M.M, Dehnavi M.M (2018) Parsy: inspection and transformation of sparse matrix computations for parallelism. In: Proceedings of the international conference for high performance computing, networking, storage, and analysis, SC 2018, Dallas, TX, USA, November 11–16, 2018, pp 62:1–62:15
Graf D, Labib K, Uznanski P (2018) Brief announcement: Hamming distance completeness and sparse matrix multiplication. In: 45th International colloquium on automata, languages, and programming, ICALP 2018, July 9–13, 2018, Prague, Czech Republic, pp 109:1–109:4
Greathouse JL, Daga M (2014) Efficient sparse matrix-vector multiplication on gpus using the CSR storage format. In: International conference for high performance computing, networking, storage and analysis, SC 2014, New Orleans, LA, USA, November 16–21, 2014, pp 769–780
Hong C, Sukumaran-Rajam A, Bandyopadhyay B, Kim J, Kurt SE, Nisa I, Sabhlok S, Çatalyürek ÜV, Parthasarathy S, Sadayappan P (2018) Efficient sparse-matrix multi-vector product on gpus. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing, HPDC 2018, Tempe, AZ, USA, June 11–15, 2018, pp 66–79
Kannan R, Ballard G, Park H (2018) MPI-FAUN: an mpi-based framework for alternating–updating nonnegative matrix factorization. IEEE Trans Knowl Data Eng 30(3):544–558
Article Google Scholar
Kaya O, Kannan R, Ballard G (2018) Partitioning and communication strategies for sparse non-negative matrix factorization. In: Proceedings of the 47th international conference on parallel processing, ICPP 2018, Eugene, OR, USA, August 13–16, 2018, pp 90:1–90:10
Kaya O, Uçar B (2015) Scalable sparse tensor decompositions in distributed memory systems. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, pp 77:1–77:11
Li K, Yang W, Li K (2015) Performance analysis and optimization for spmv on GPU using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26(1):196–205
Article MathSciNet Google Scholar
Li K, Yang W, Li K (2016) A hybrid parallel solving algorithm on GPU for quasi-tridiagonal system of linear equations. IEEE Trans Parallel Distrib Syst 27(10):2795–2808
Article Google Scholar
Liu C, Xie B, Liu X, Xue W, Yang H, Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 32nd international conference on supercomputing, ICS 2018, Beijing, China, June 12–15, 2018, pp 363–373
Liu J, He X, Liu W, Tan G (2018) Register-based implementation of the sparse general matrix–matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP 2018, Vienna, Austria, February 24–28, 2018, pp 407–408
Ordonez C, Zhang Y, Cabrera W (2016) The gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans Knowl Data Eng 28(7):1905–1918
Article Google Scholar
Pal S, Beaumont J, Park D, Amarnath A, Feng S, Chakrabarti C, Kim H, Blaauw DT, Mudge TN, Dreslinski RG (2018) Outerspace: an outer product based sparse matrix multiplication accelerator. In: IEEE international symposium on high performance computer architecture, HPCA 2018, Vienna, Austria, February 24–28, 2018, pp 724–736
Pichon G, Faverge M, Ramet P, Roman J (2017) Reordering strategy for blocking optimization in sparse linear solvers. SIAM J Matrix Anal Appl 38(1):226–248
Article MathSciNet MATH Google Scholar
Schaub MT, Trefois M, Dooren PV, Delvenne J (2017) Sparse matrix factorizations for fast linear solvers with application to laplacian systems. SIAM J Matrix Anal Appl 38(2):505–529
Article MathSciNet MATH Google Scholar
Sulatycke P, Ghose K (1998) Caching-efficient multithreaded fast multiplication of sparse matrices. In: IPPS/SPDP, pp 117–123
Sun Q, Zhang C, Wu C, Zhang J, Li L (2018) Bandwidth reduced parallel spmv on the SW26010 many-core platform. In: Proceedings of the 47th international conference on parallel processing, ICPP 2018, Eugene, OR, USA, August 13–16, 2018, pp 54:1–54:10
Wang S, Liu J, Shroff NB (2018) Coded sparse matrix multiplication. In: Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, pp 5139–5147
Xiao G, Li K, Li K (2017) Reporting l most influential objects in uncertain databases based on probabilistic reverse top-k queries. Inf Sci 405:207–226
Article Google Scholar
Xiao G, Li K, Li K, Zhou X (2015) Efficient top-(k, l) range query processing for uncertain data based on multicore architectures. Distrib Parallel Databases 33(3):381–413
Article Google Scholar
Xiao G, Li K, Zhou X, Li K (2017) Efficient monochromatic and bichromatic probabilistic reverse top-k query processing for uncertain big data. J Comput Syst Sci 89:92–113
Article MathSciNet MATH Google Scholar
Yang W, Li K, Mo Z, Li K (2015) Performance optimization using partitioned spmv on gpus and multicore cpus. IEEE Trans Comput 64(9):2623–2636
Article MathSciNet MATH Google Scholar
Zhang J, Gruenwald L (2018) Regularizing irregularity: bitmap-based and portable sparse matrix multiplication for graph data on gpus. In: Proceedings of the 1st ACM SIGMOD joint international workshop on graph data management experiences and systems (GRADES) and network data analytics (NDA), Houston, TX, USA, June 10, 2018, pp 4:1–4:8
Zhao Y, Li J, Liao C, Shen X (2017) POSTER: bridging the gap between deep learning and sparse matrix format selection. In: 26th International conference on parallel architectures and compilation techniques, PACT 2017, Portland, OR, USA, September 9–13, 2017, pp 152–153
Zheng D, Mhembere D, Lyzinski V, Vogelstein JT, Priebe CE, Burns RC (2017) Semi-external memory sparse matrix multiplication for billion-node graphs. IEEE Trans Parallel Distrib Syst 28(5):1470–1483
Article Google Scholar

Download references

Acknowledgements

The research was partially funded by the National Key R&D Program of China (Grant No. 2018YFB0203800), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International (Regional) Cooperation and Exchange Program of National Natural Science Foundation of China (Grant Nos. 61661146006, 61860206011), the Program of National Natural Science Foundation of China (Grant Nos. 61572175, 61806077), the Program of Hunan Provincial Innovation Foundation for Postgraduate (Grant No. CX2018B230), the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (Grant No. OCPC2017032), and the Fellowship Program of China Scholarship Council.

Author information

Authors and Affiliations

College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
Yuedan Chen, Guoqing Xiao & Wangdong Yang
National Supercomputing Center in Changsha, Changsha, 410082, Hunan, China
Yuedan Chen, Guoqing Xiao & Wangdong Yang
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, N2L 3G1, ON, Canada
Yuedan Chen & Guoqing Xiao

Authors

Yuedan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guoqing Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Wangdong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoqing Xiao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Xiao, G. & Yang, W. Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight. Neural Comput & Applic 32, 5571–5582 (2020). https://doi.org/10.1007/s00521-019-04121-z

Download citation

Received: 09 January 2019
Accepted: 22 February 2019
Published: 07 March 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00521-019-04121-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight

Abstract

Access this article

Similar content being viewed by others

Designing Parallel Sparse Matrix Transposition Algorithm Using CSR for GPUs

Parallelization of Sparse Matrix Kernels for Big Data Applications

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight

Abstract

Access this article

Similar content being viewed by others

Designing Parallel Sparse Matrix Transposition Algorithm Using CSR for GPUs

Parallelization of Sparse Matrix Kernels for Big Data Applications

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation