swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight

Liu, Xiaoyan; Liu, Yi; Yin, Bohong; Yang, Hailong; Luan, Zhongzhi; Qian, Depei

doi:10.1007/s11704-022-1749-6

swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight

Research Article
Published: 07 November 2022

Volume 17, article number 174104, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Xiaoyan Liu^1,2,
Yi Liu²,
Bohong Yin²,
Hailong Yang^1,2,
Zhongzhi Luan² &
…
Depei Qian²

114 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Although matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse matrix multiplications. The Sparse Approximate Matrix Multiply (SpAMM) is an algorithm to accelerate the multiplication of decay matrices, the sparsity of which is between dense and sparse matrices. In addition, large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems. To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight, we present swSpAMM, an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.

Specifically, we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution. For intra-node optimizations, we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor. For inter-node optimizations, we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes. We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes. The experiment results show that swSpAMM achieves a speedup up to 14.5× and 2.2× when compared to xMath library on a single node and 2D GEMM method on multiple nodes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scaling sparse matrix-matrix multiplication in the accumulo database

Article 28 January 2019

Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

Article Open access 17 July 2024

Towards an Auto-Tuned and Task-Based SpMV (LASs Library)

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys, 2020, 52(4): 65
Article Google Scholar
Azad A, Buluç, A, Gilbert J. Parallel triangle counting and enumeration using matrix algebra. In: Proceedings of 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 2015, 804–811
Del Ben M, Schütt O, Wentz T, Messmer P, Hutter J, VandeVondele J. Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Computer Physics Communications, 2015, 187: 120–129
Article Google Scholar
Li X P, Nunes R W, Vanderbilt D. Density-matrix electronic-structure method with linear system-size scaling. Physical Review B, 1993, 47(16): 10891–10894
Article Google Scholar
Challacombe M. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Computer Physics Communications, 2000, 128(1–2): 93–107
Article Google Scholar
Rubensson E H, Rudberg E, Salek P. Methods for Hartree-Fock and density functional theory electronic structure calculations with linearly scaling processor time and memory usage. In: Zalesny R, Papadopoulos M G, Mezey P G, Leszczynski J, eds. Linear-Scaling Techniques in Computational Chemistry and Physics. Dordrecht: Springer, 2011, 263–300
Chapter Google Scholar
Gale T, Zaharia M, Young C, Elsen E. Sparse GPU kernels for deep learning. In: Proceedings of SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1–14
Liu X, Liu Y, Yang H, Dun M, Yin B, Luan Z, Qian D. Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. The Journal of Supercomputing, 2022, doi: https://doi.org/10.1007/s11227-022-04334-5
Demko S, Moss W F, Smith P W. Decay rates for inverses of band matrices. Mathematics of Computation, 1984, 43(168): 491–499
Article MathSciNet Google Scholar
Benzi M, Boito P, Razouk N. Decay properties of spectral projectors with applications to electronic structure. SIAM Review, 2013, 55(1): 3–64
Article MathSciNet Google Scholar
Bowler D R, Miyazaki T. O(N) methods in electronic structure calculations. Reports on Progress in Physics, 2012, 75(3): 036503
Article Google Scholar
Kirchner B, di Dio P J, Hutter J. Real-world predictions from ab initio molecular dynamics simulations. In: Kirchner B, Vrabec J, eds. Multiscale Molecular Methods in Applied Chemistry. Berlin: Springer, 2011, 109–153
Chapter Google Scholar
Cramer M, Eisert J. Correlations, spectral gap and entanglement in harmonic quantum systems on generic lattices. New Journal of Physics, 2006, 8(5): 71
Article MathSciNet Google Scholar
Cramer M, Eisert J, Plenio M B, Dreißig J. Entanglement-area law for general bosonic harmonic lattice systems. Physical Review A, 2006, 73(1): 012309
Article Google Scholar
Eisert J, Cramer M, Plenio M B. Area laws for the entanglement entropy — a review. 2008, arXiv preprint arXiv: 0808.3773
Schuch N, Cirac J I, Wolf M M. Quantum states on harmonic lattices. Communications in Mathematical Physics, 2006, 267(1): 65–92
Article MathSciNet Google Scholar
Buluç A, Gilbert J R. Parallel sparse matrix-matrix multiplication and indexing: implementation and experiments. SIAM Journal on Scientific Computing, 2012, 34(4): C170–C191
Article MathSciNet Google Scholar
Im E J, Yelick K. Optimizing sparse matrix computations for register reuse in SPARSITY. In: Proceedings of International Conference on Computational Science. 2001, 127–136
Challacombe M, Bock N. Fast multiplication of matrices with decay. 2010, arXiv preprint arXiv: 1011.3534
Bock N, Challacombe M, Kalé L V. Solvers for O(N) electronic structure in the strong scaling limit. SIAM Journal on Scientific Computing, 2016, 38(1): C1–C21
Article MathSciNet Google Scholar
Rudberg E, Rubensson E H, Sałek P, Kruchinina A. Ergo: an open-source program for linear-scaling electronic structure calculations. SoftwareX, 2018, 7: 107–111
Article Google Scholar
Cannon L E. A cellular computer to implement the Kalman filter algorithm. Montana State University, Dissertation, 1969
Google Scholar
Blackford L S, Choi J, Cleary A, D’Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley R C, Dongarra J J. ScaLAPACK User’s Guide. Philadelphia: Society for Industrial and Applied Mathematics, 1997
Solomonik E, Demmel J. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Proceedings of the 17th International Euro-ParConference. 2011, 90–109
Lazzaro A, VandeVondele J, Hutter J, Schütt O. Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5D algorithm and one-sided MPI. In: Proceedings of Platform for Advanced Scientific Computing Conference. 2017, 3
Moldaschl M, Prikopa K E, Gansterer W N. Fault tolerant communication-optimal 2.5D matrix multiplication. Journal of Parallel and Distributed Computing, 2017, 104: 179–190
Article Google Scholar
Agarwal R C, Balle S M, Gustavson F G, Joshi M, Palkar P. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 1995, 39(5): 575–582
Article Google Scholar
Siegel J, Villa O, Krishnamoorthy S, Tumeo A, Li X. Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems. In: Proceedings of 2010 IEEE International Conference on Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS). 2010, 1–8
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F, Zhao W, Yin X, Hou C, Zhang C, Ge W, Zhang J, Wang Y, Zhou C, Yang G. The Sunway Taihulight supercomputer: system and applications. Science China Information Sciences, 2016, 59(7): 072001
Article Google Scholar
Fu H, Liao J, Xue W, Wang L, Chen D, Gu L, Xu J, Ding N, Wang X, He C, Xu S, Liang Y, Fang J, Xu Y, Zheng W, Xu J, Zheng Z, Wei W, Ji X, Zhang H, Chen B, Li K, Huang X, Chen W, Yang G. Refactoring and optimizing the community atmosphere model (CAM) on the Sunway Taihulight supercomputer. In: SC’16: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 969–980
Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L, Hoefler T, Ma X, Liu X, Zheng W, Xu J. ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 706–716
Yue H, Deng L, Meng D, Wang Y, Sun Y. Parallelization and optimization of large-scale CFD simulations on Sunway Taihulight system. In: Proceedings of the 13th Conference on Advanced Computer Architecture. 2020, 260–274
Yang C, Xue W, Fu H, You H, Wang X, Ao Y, Liu F, Gan L, Xu P, Wang L, Yang G, Zheng W. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: SC’16: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 57–68
Xu Z, Lin J, Matsuoka S. Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743–752
Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message Passing Interface. Cambridge: MIT Press, 1999
Book Google Scholar
Kwasniewski G, Kabić M, Besta M, VandeVondele J, Solcà R, Hoefler T. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 24
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587
Artemov A. Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework. 2019, arXiv preprint arXiv: 1906.08148
Kale L V, Krishnan S. CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. 1993, 91–108
Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998, 5(1): 46–55
Article Google Scholar
Rubensson E H, Rudberg E. Chunks and tasks: a programming model for parallelization of dynamic algorithms. Parallel Computing, 2014, 40(7): 328–343
Article Google Scholar
Liu C, Xie B, Liu X, Xue W, Yang H, Liu X. Towards efficient SpMV on Sunway Manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363–373
Dun M, Li Y, Sun Q, Yang H, Li W, Luan Z, Gan L, Yang G, Qian D. Towards efficient canonical polyadic decomposition on Sunway many-core processor. Information Sciences, 2021, 549: 221–248
Article MathSciNet Google Scholar
Zhong X, Li M, Yang H, Liu Y, Qian D. swMR: a framework for accelerating MapReduce applications on Sunway Taihulight. IEEE Transactions on Emerging Topics in Computing, 2021, 9(2): 1020–1030
Article Google Scholar
Han Q, Yang H, Dun M, Luan Z, Gan L, Yang G, Qian D. Towards efficient tile low-rank GEMM computation on Sunway many-core processors. The Journal of Supercomputing, 2021, 77(5): 4533–4564
Article Google Scholar
Li M, Liu Y, Yang H, Hu Y, Sun Q, Chen B, You X, Liu X, Luan Z, Qian D. Automatic code generation and optimization of large-scale stencil computation on many-core processors. In: Proceedings of the 50th International Conference on Parallel Processing. 2021, 34
Hu Y, Yang H, Luan Z, Gan L, Yang G, Qian D. Massively scaling seismic processing on Sunway Taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(5): 1194–1208
Article Google Scholar
Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D. Accelerating sparse cholesky factorization on Sunway Manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(7): 1636–1650
Article Google Scholar
Wang X, Liu W, Xue W, Wu L. swSpTRSV: a fast sparse triangular solve with sparse level tile layout on Sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338–353

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2020YFB1506703), the National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002), and State Key Laboratory of Software Development Environment (SKLSDE-2021ZX-06)

Author information

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beijing, 100191, China
Xiaoyan Liu & Hailong Yang
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Xiaoyan Liu, Yi Liu, Bohong Yin, Hailong Yang, Zhongzhi Luan & Depei Qian

Authors

Xiaoyan Liu
View author publications
Search author on:PubMed Google Scholar
Yi Liu
View author publications
Search author on:PubMed Google Scholar
Bohong Yin
View author publications
Search author on:PubMed Google Scholar
Hailong Yang
View author publications
Search author on:PubMed Google Scholar
Zhongzhi Luan
View author publications
Search author on:PubMed Google Scholar
Depei Qian
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Additional information

Xiaoyan Liu is a PhD student in School of Computer Science and Engineering, Beihang University, China. She is currently working on performance optimization of scientific applications. Her research interests include HPC, approximate calculation and performance optimization.

Yi Liu is a professor in School of Computer Science and Engineering, and Director of the Sino-German Joint Software Institute (JSI) at Beihang University, China. In 2000, he completed PhD in Department of Computer Science of Xi’an Jiaotong University, China. His research interests include computer architecture, HPC and new generation of network technology.

Bohong Yin is a master student in School of Computer Science and Engineering, Beihang University, China. He is currently working on performance optimization on distributed system. His research interests include HPC, performance optimization, and distributed communication.

Hailong Yang is an associate professor in School of Computer Science and Engineering, Beihang University, China. He received the PhD degree in the School of Computer Science and Engineering, Beihang University, China in 2014. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency.

Zhongzhi Luan received the PhD in the School of Computer Science of Xi’an Jiaotong University, China. He is an associate professor of Computer Science and Engineering at Beihang University, China. His research interests include distributed computing, parallel computing, grid computing, HPC and the new generation of network technology.

Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas, USA in 1984. His research interests include innovative technologies in distributed computing, high performance computing and computer architecture.

Electronic supplementary material