swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture

Tian, Min; Wang, Junjie; Zhang, Zanjun; Du, Wei; Pan, Jingshan; Liu, Tao

doi:10.1007/s11227-021-04270-w

swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture

Published: 11 February 2022

Volume 78, pages 11441–11463, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Min Tian¹,
Junjie Wang²,
Zanjun Zhang^1,3,
Wei Du¹,
Jingshan Pan¹ &
…
Tao Liu¹

459 Accesses
Explore all metrics

Abstract

Sparse LU factorization is essential for scientific and engineering simulations. In this work, we present swSuperLU, a highly scalable sparse direct solver on Sunway manycore architecture based on sparse LU factorization. To improve the parallelism of sparse LU factorization, we introduce the hierarchical scheme to exploit the hierarchy of Sunway manycore architecture in process-level parallelism between MPEs and thread-level parallelism between the CPE arrays. A task-based hierarchical scheme and a series of highly optimized computation kernels are designed to map processor loads and memory access well to this hierarchy. Moreover, we compared various ordering strategies and several machine-dependent parameter settings to find the most suitable ordering strategies and parameter settings for Sunway manycore architecture. We present performance and scalability experiments of swSuperLU on Newest Generation Sunway Supercomputer and Sunway TaihuLight. swSuperLU achieves 9.02$\times$ speedup on average compared to state-of-the-art packages and strong scalability from 10 thousand cores to million cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward efficient structured-grid triangular solver on sunway many-core processors

Article 27 December 2023

Sparse Direct Solution on Parallel Computers

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

References

Harrington RF (1993) Field Computation by Moment Methods. Wiley-IEEE Press, Hoboken
Book Google Scholar
Jin JM (2011) Theory and computation of electromagnetic fields. John Wiley & Sons, Hoboken
Google Scholar
Wu YS (2015) Multiphase fluid flow in porous and fractured reservoirs. Gulf professional publishing, Oxford
Google Scholar
Blazek J (2015) Computational fluid dynamics: principles and applications. Butterworth-Heinemann, Oxford
MATH Google Scholar
Davis TA (2006) Direct methods for sparse linear systems. SIAM, Philadelphia
Book Google Scholar
Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelphia
Book Google Scholar
Demmel JW, Eisenstat SC, Gilbert JR, Li XS, Liu JW (1999) A supernodal approach to sparse partial pivoting. SIAM J Matrix Anal Appl 20(3):720–755
Article MathSciNet Google Scholar
Gilbert JR, Liu JW (1993) Elimination structures for unsymmetric sparse lu factors. SIAM J Matrix Anal Appl 14(2):334–352
Article MathSciNet Google Scholar
Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G et al (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151
Article MathSciNet Google Scholar
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al (2016) The sunway taihulight supercomputer: system and applications. Sci China Inform Sci 59(7):1–16
Article Google Scholar
Liu Y, Jacquelin M, Ghysels P, Li XS (2018) Highly scalable distributed-memory sparse triangular solution algorithms. In: 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, pp. 87–96. SIAM
Yamazaki I, Li XS (2012) New scheduling strategies and hybrid programming for a parallel right-looking sparse lu factorization algorithm on multicore cluster systems. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 619–630. IEEE
Sao P, Li XS, Vuduc R (2018) A communication-avoiding 3d lu factorization algorithm for sparse matrices. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 908–919. IEEE
Sao P, Vuduc R, Li XS (2014) A distributed cpu-gpu sparse direct solver. In: European Conference on Parallel Processing, pp. 487–498. Springer
Sao P, Liu X, Vuduc R, Li X (2015) A sparse direct solver for distributed memory xeon phi-accelerated systems. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 71–81. IEEE
Niu Y, Lu Z, Dong M, Jin Z, Liu W, Tan G (2021) Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 68–78. IEEE
Su J, Zhang F, Liu W, He B, Wu R, Du X, Wang R (2020) Capellinisptrsv: A thread-level synchronization-free sparse triangular solve on gpus. In: 49th International Conference on Parallel Processing-ICPP, pp. 1–11
Lu Z, Niu Y, Liu W (2020) Efficient block algorithms for parallel sparse triangular solve. In: 49th International Conference on Parallel Processing-ICPP, pp. 1–11
Duan X, Gao P, Zhang T, Zhang M, Liu W, Zhang W, Xue W, Fu H, Gan L, Chen D et al (2018) Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 148–159. IEEE
Chen B, Fu H, Wei Y, He C, Zhang W, Li Y, Wan W, Zhang W, Gan L, Zhang Z et al (2018) Simulating the wenchuan earthquake with accurate surface topography on sunway taihulight. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 517–528. IEEE
Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang X, Yang J, Zheng Y, Liu W et al (2017) Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12
Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L, Hoefler T, Ma X, Liu X et al (2018)Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 706–716. IEEE
Zhong X, Li M, Yang H, Liu Y, Qian D (2018) swmr: a framework for accelerating mapreduce applications on sunway taihulight. IEEE Transactions on Emerging Topics in Computing
Li L, Fang J, Fu H, Jiang J, Zhao W, He C, You X, Yang G (2018) swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 413–422. IEEE
Liu C, Xie B, Liu X, Xue W, Yang H, Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp. 363–373
Li M, Liu Y, Yang H, Luan Z, Qian D (2018) Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 594–601. IEEE
Wang X, Liu W, Xue W, Wu L (2018) swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353
Fang J, Fu H, Zhao W, Chen B, Zheng W, Yang G (2017) swdnn: A library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615–624. IEEE
Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D (2019) Accelerating sparse cholesky factorization on sunway manycore architecture. IEEE Trans Parallel Distrib Syst 31(7):1636–1650
Article Google Scholar
Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw (TOMS) 38(1):1–25
MathSciNet MATH Google Scholar
Rose DJ, Tarjan RE, Lueker GS (1976) Algorithmic aspects of vertex elimination on graphs. SIAM J Comput 5(2):266–283
Article MathSciNet Google Scholar
Rose DJ, Tarjan RE (1978) Algorithmic aspects of vertex elimination on directed graphs. SIAM J Appl Math 34(1):176–197
Article MathSciNet Google Scholar
Gilbert JR (1980) A note on the np-completeness of vertex elimination on directed graphs. SIAM J Algebraic Discrete Methods 1(3):292–294
Article MathSciNet Google Scholar
Yannakakis M (1981) Computing the minimum fill-in is np-complete. SIAM J Algeb Discrete Methods 2(1):77–79
Article MathSciNet Google Scholar
Tinney WF, Walker JW (1967) Direct solutions of sparse network equations by optimally ordered triangular factorization. Proc IEEE 55(11):1801–1809
Article Google Scholar
Rose DJ (1972) A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations In Graph Theory and Computing. Elsevier, New York, pp 183–217
Google Scholar
Amestoy PR, Davis TA, Duff IS (1996) An approximate minimum degree ordering algorithm. SIAM J Matrix Anal Appl 17(4):886–905
Article MathSciNet Google Scholar
Eisenstat SC, Schultz MH, Sherman AH (1981) Algorithms and data structures for sparse symmetric gaussian elimination. SIAM J Sc Statist Comput 2(2):225–237
Article MathSciNet Google Scholar
Liu JW (1985) Modification of the minimum-degree algorithm by multiple elimination. ACM Trans Math Softw (TOMS) 11(2):141–153
Article MathSciNet Google Scholar
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Article MathSciNet Google Scholar
Karypis G, Kumar V (1998) Multilevelk-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
Article Google Scholar
Karypis G, Kumar V (1998) Multilevel algorithms for multi-constraint graph partitioning. In: SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, pp. 28–28. IEEE
Karypis G, Kumar V (1997) A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm. In: PPSC
Schloegel K, Karypis G, Kumar V (2000) Parallel multilevel algorithms for multi-constraint graph partitioning. In: European Conference on Parallel Processing, pp. 296–310. Springer
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 1969 24th National Conference, pp. 157–172
George A, Liu JW (1981) Computer solution of large sparse positive definite. Prentice Hall Professional Technical Reference, Englewood Cliffs
MATH Google Scholar
Li XS, Demmel JW (2003) Superlu\_dist: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans Math Softw (TOMS) 29(2):110–140
Article Google Scholar
Amestoy PR, Duff IS, L’Excellent J-Y, Koster J (2001) A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl 23(1):15–41
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 62002186), the Shandong Provincial Natural Science Foundation (Grant No. ZR2019PF015), the “Colleges and Universities 20 Terms” Foundation of Jinan City, China (2018GXRC015) and the Research and Application Demonstration of Key Technologies of Autonomous Controllable Supercomputing Software Ecosystem Project (2020KJC-ZD01).

Author information

Authors and Affiliations

Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong, China
Min Tian, Zanjun Zhang, Wei Du, Jingshan Pan & Tao Liu
School of Mathematics and Statistics, Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong, China
Junjie Wang
Shanxi Key Laboratory of Large Scale Electromagnetic Computing, Xidian University, Xi’an, Shaanxi, China
Zanjun Zhang

Authors

Min Tian
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zanjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Du
View author publications
You can also search for this author in PubMed Google Scholar
Jingshan Pan
View author publications
You can also search for this author in PubMed Google Scholar
Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zanjun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, M., Wang, J., Zhang, Z. et al. swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture. J Supercomput 78, 11441–11463 (2022). https://doi.org/10.1007/s11227-021-04270-w

Download citation

Accepted: 20 December 2021
Published: 11 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11227-021-04270-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Toward efficient structured-grid triangular solver on sunway many-core processors

Sparse Direct Solution on Parallel Computers

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now