Abstract
Sparse LU factorization is essential for scientific and engineering simulations. In this work, we present swSuperLU, a highly scalable sparse direct solver on Sunway manycore architecture based on sparse LU factorization. To improve the parallelism of sparse LU factorization, we introduce the hierarchical scheme to exploit the hierarchy of Sunway manycore architecture in process-level parallelism between MPEs and thread-level parallelism between the CPE arrays. A task-based hierarchical scheme and a series of highly optimized computation kernels are designed to map processor loads and memory access well to this hierarchy. Moreover, we compared various ordering strategies and several machine-dependent parameter settings to find the most suitable ordering strategies and parameter settings for Sunway manycore architecture. We present performance and scalability experiments of swSuperLU on Newest Generation Sunway Supercomputer and Sunway TaihuLight. swSuperLU achieves 9.02\(\times\) speedup on average compared to state-of-the-art packages and strong scalability from 10 thousand cores to million cores.




Similar content being viewed by others
References
Harrington RF (1993) Field Computation by Moment Methods. Wiley-IEEE Press, Hoboken
Jin JM (2011) Theory and computation of electromagnetic fields. John Wiley & Sons, Hoboken
Wu YS (2015) Multiphase fluid flow in porous and fractured reservoirs. Gulf professional publishing, Oxford
Blazek J (2015) Computational fluid dynamics: principles and applications. Butterworth-Heinemann, Oxford
Davis TA (2006) Direct methods for sparse linear systems. SIAM, Philadelphia
Saad Y (2003) Iterative methods for sparse linear systems. SIAM, Philadelphia
Demmel JW, Eisenstat SC, Gilbert JR, Li XS, Liu JW (1999) A supernodal approach to sparse partial pivoting. SIAM J Matrix Anal Appl 20(3):720–755
Gilbert JR, Liu JW (1993) Elimination structures for unsymmetric sparse lu factors. SIAM J Matrix Anal Appl 14(2):334–352
Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G et al (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al (2016) The sunway taihulight supercomputer: system and applications. Sci China Inform Sci 59(7):1–16
Liu Y, Jacquelin M, Ghysels P, Li XS (2018) Highly scalable distributed-memory sparse triangular solution algorithms. In: 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, pp. 87–96. SIAM
Yamazaki I, Li XS (2012) New scheduling strategies and hybrid programming for a parallel right-looking sparse lu factorization algorithm on multicore cluster systems. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 619–630. IEEE
Sao P, Li XS, Vuduc R (2018) A communication-avoiding 3d lu factorization algorithm for sparse matrices. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 908–919. IEEE
Sao P, Vuduc R, Li XS (2014) A distributed cpu-gpu sparse direct solver. In: European Conference on Parallel Processing, pp. 487–498. Springer
Sao P, Liu X, Vuduc R, Li X (2015) A sparse direct solver for distributed memory xeon phi-accelerated systems. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 71–81. IEEE
Niu Y, Lu Z, Dong M, Jin Z, Liu W, Tan G (2021) Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 68–78. IEEE
Su J, Zhang F, Liu W, He B, Wu R, Du X, Wang R (2020) Capellinisptrsv: A thread-level synchronization-free sparse triangular solve on gpus. In: 49th International Conference on Parallel Processing-ICPP, pp. 1–11
Lu Z, Niu Y, Liu W (2020) Efficient block algorithms for parallel sparse triangular solve. In: 49th International Conference on Parallel Processing-ICPP, pp. 1–11
Duan X, Gao P, Zhang T, Zhang M, Liu W, Zhang W, Xue W, Fu H, Gan L, Chen D et al (2018) Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 148–159. IEEE
Chen B, Fu H, Wei Y, He C, Zhang W, Li Y, Wan W, Zhang W, Gan L, Zhang Z et al (2018) Simulating the wenchuan earthquake with accurate surface topography on sunway taihulight. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 517–528. IEEE
Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang X, Yang J, Zheng Y, Liu W et al (2017) Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12
Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L, Hoefler T, Ma X, Liu X et al (2018)Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 706–716. IEEE
Zhong X, Li M, Yang H, Liu Y, Qian D (2018) swmr: a framework for accelerating mapreduce applications on sunway taihulight. IEEE Transactions on Emerging Topics in Computing
Li L, Fang J, Fu H, Jiang J, Zhao W, He C, You X, Yang G (2018) swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 413–422. IEEE
Liu C, Xie B, Liu X, Xue W, Yang H, Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp. 363–373
Li M, Liu Y, Yang H, Luan Z, Qian D (2018) Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 594–601. IEEE
Wang X, Liu W, Xue W, Wu L (2018) swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353
Fang J, Fu H, Zhao W, Chen B, Zheng W, Yang G (2017) swdnn: A library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615–624. IEEE
Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D (2019) Accelerating sparse cholesky factorization on sunway manycore architecture. IEEE Trans Parallel Distrib Syst 31(7):1636–1650
Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw (TOMS) 38(1):1–25
Rose DJ, Tarjan RE, Lueker GS (1976) Algorithmic aspects of vertex elimination on graphs. SIAM J Comput 5(2):266–283
Rose DJ, Tarjan RE (1978) Algorithmic aspects of vertex elimination on directed graphs. SIAM J Appl Math 34(1):176–197
Gilbert JR (1980) A note on the np-completeness of vertex elimination on directed graphs. SIAM J Algebraic Discrete Methods 1(3):292–294
Yannakakis M (1981) Computing the minimum fill-in is np-complete. SIAM J Algeb Discrete Methods 2(1):77–79
Tinney WF, Walker JW (1967) Direct solutions of sparse network equations by optimally ordered triangular factorization. Proc IEEE 55(11):1801–1809
Rose DJ (1972) A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations In Graph Theory and Computing. Elsevier, New York, pp 183–217
Amestoy PR, Davis TA, Duff IS (1996) An approximate minimum degree ordering algorithm. SIAM J Matrix Anal Appl 17(4):886–905
Eisenstat SC, Schultz MH, Sherman AH (1981) Algorithms and data structures for sparse symmetric gaussian elimination. SIAM J Sc Statist Comput 2(2):225–237
Liu JW (1985) Modification of the minimum-degree algorithm by multiple elimination. ACM Trans Math Softw (TOMS) 11(2):141–153
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Karypis G, Kumar V (1998) Multilevelk-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
Karypis G, Kumar V (1998) Multilevel algorithms for multi-constraint graph partitioning. In: SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, pp. 28–28. IEEE
Karypis G, Kumar V (1997) A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm. In: PPSC
Schloegel K, Karypis G, Kumar V (2000) Parallel multilevel algorithms for multi-constraint graph partitioning. In: European Conference on Parallel Processing, pp. 296–310. Springer
Cuthill E, McKee J (1969) Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 1969 24th National Conference, pp. 157–172
George A, Liu JW (1981) Computer solution of large sparse positive definite. Prentice Hall Professional Technical Reference, Englewood Cliffs
Li XS, Demmel JW (2003) Superlu\_dist: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans Math Softw (TOMS) 29(2):110–140
Amestoy PR, Duff IS, L’Excellent J-Y, Koster J (2001) A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl 23(1):15–41
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 62002186), the Shandong Provincial Natural Science Foundation (Grant No. ZR2019PF015), the “Colleges and Universities 20 Terms” Foundation of Jinan City, China (2018GXRC015) and the Research and Application Demonstration of Key Technologies of Autonomous Controllable Supercomputing Software Ecosystem Project (2020KJC-ZD01).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tian, M., Wang, J., Zhang, Z. et al. swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture. J Supercomput 78, 11441–11463 (2022). https://doi.org/10.1007/s11227-021-04270-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04270-w