ABSTRACT
Sparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. However, such a data layout may bring more unevenness when the structure of the input matrix is not ideal, and using dense BLAS may waste many floating-point operations on zero fill-ins.
In this paper, we propose a new sparse direct solver called PanguLU. Unlike the multifrontal/supernodal layout, our work relies on simpler regular 2D blocking and stores the blocks in their sparse forms to avoid any extra fill-ins. Based on the sparse patterns of the blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust the mapping of blocks to processes for overall more balanced workload, and propose a synchronisation-free communication strategy considering the dependencies among different sub-tasks to reduce overall latency overhead.
Experiments on two distributed heterogeneous platforms consisting of 128 NVIDIA A100 GPUs and 128 AMD MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over the latest SuperLU_DIST, and scales up to 47.51x and 74.84x on the 128 A100 and MI50 GPUs over a single GPU, respectively.
- https://www.top500.org/. 2023.Google Scholar
- E. Agullo, P. R. Amestoy, A. Buttari, A. Guermouche, J.-Y. L'Excellent, and F.-H. Rouet. Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations. SIAM Journal on Scientific Computing, 38(3), 2016.Google Scholar
- E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU Factorization for Accelerator-Based Systems. In AICCSA '11. IEEE, 2011.Google Scholar
- P. R. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. Improving Multifrontal Methods by Means of Block Low-rank Representations. SIAM Journal on Scientific Computing, 37(3), 2015.Google ScholarDigital Library
- P. R. Amestoy, T. A. Davis, and I. S. Duff. An Approximate Minimum Degree Ordering Algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4), 1996.Google Scholar
- P. R. Amestoy, T. A. Davis, and I. S. Duff. Algorithm 837: AMD, An Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google Scholar
- P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1), 2001.Google ScholarDigital Library
- P. R. Amestoy, I. S. Duff, J.-Y. L' Excellent, and J. Koster. MUMPS: A General Purpose Distributed Memory Sparse Solver. In International Workshop on Applied Parallel Computing, 2000.Google Scholar
- P. R. Amestoy, I. S. Duff, and C. Vömel. Task Scheduling in an Asynchronous Distributed Memory Multifrontal Solver. SIAM Journal on Matrix Analysis and Applications, 26(2), 2004.Google ScholarDigital Library
- P. R. Amestoy and C. Puglisi. An Unsymmetrized Multifrontal LU Factorization. SIAM Journal on Matrix Analysis and Applications, 24(2), 2002.Google Scholar
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, and A. McKenney. LAPACK Users' Guide. SIAM, 1999.Google ScholarDigital Library
- S. Cayrols, I. S. Duff, and F. Lopez. Parallelization of the Solve Phase in A Task-based Cholesky Solver Using A Sequential Task Flow Model. The International Journal of High Performance Computing Applications, 34(3), 2020.Google ScholarDigital Library
- X. Chen, L. Ren, Y. Wang, and H. Yang. GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling. IEEE Transactions on Parallel and Distributed Systems, 26(3), 2015.Google Scholar
- X. Chen, Y. Wang, and H. Yang. An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation. In ASP-DAC '12, 2012.Google Scholar
- X. Chen, Y. Wang, and H. Yang. NICSLU: An Adaptive Sparse Matrix Solver For parallel Circuit Simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(2), 2013.Google ScholarDigital Library
- X. Chen, Y. Wang, and H. Yang. A Fast Parallel Sparse Solver For SPICE-based Circuit Simulators. In DATE '15, 2015.Google ScholarCross Ref
- T. A. Davis. Algorithm 832: UMFPACK V4.3---an Unsymmetric-Pattern Multifrontal Method. ACM Transactions on Mathematical Software, 30(2), 2004.Google Scholar
- T. A. Davis. Direct Methods for Sparse Linear Systems. SIAM, 2006.Google ScholarDigital Library
- T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google Scholar
- T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. Algorithm 836: COLAMD, A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google Scholar
- T. A. Davis and W. W. Hager. Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves. ACM Transactions on Mathematical Software, 35(4), 2009.Google Scholar
- T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38(1), 2011.Google Scholar
- T. A. Davis and E. Palamadai Natarajan. Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems. ACM Transactions on Mathematical Software, 37(3), 2010.Google Scholar
- J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu. A Supernodal Approach to Sparse Partial Pivoting. SIAM Journal on Matrix Analysis and Applications, 20(3), 1999.Google Scholar
- J. W. Demmel, J. R. Gilbert, and X. S. Li. An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.Google ScholarDigital Library
- N. Ding, Y. Liu, S. Williams, and X. S. Li. A Message-driven, Multi-GPU Parallel Sparse Triangular Solver. In ACDA21 '21, 2021.Google Scholar
- I. S. Duff. A Survey of Sparse Matrix Research. Proceedings of the IEEE, 65(4), 1977.Google Scholar
- I. S. Duff. Parallel Implementation of Multifrontal Schemes. Parallel computing, 3(3), 1986.Google Scholar
- I. S. Duff. Parallel Algorithms for Sparse Matrix Solution. Parallel Computing, 2020.Google ScholarCross Ref
- I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Inc, 2nd edition, 2017.Google ScholarCross Ref
- I. S. Duff, J. D. Hogg, and F. Lopez. A New Sparse Symmetric Indefinite Solver Using A Posteriori Threshold Pivoting. NLAFET Working Note, 2018.Google Scholar
- I. S. Duff and J. Koster. The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.Google ScholarDigital Library
- I. S. Duff and J. Koster. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix. SIAM Journal on Matrix Analysis and Applications, 22(4), 2001.Google ScholarDigital Library
- I. S. Duff, P. Leleux, D. Ruiz, and F. S. Torun. Improving the Scalability of the ABCD Solver with a Combination of New Load Balancing and Communication Minimization Techniques. In Parco '19, 2019.Google Scholar
- I. S. Duff and S. Pralet. Strategies for Scaling and Pivoting for Sparse Symmetric Indefinite Problems. SIAM Journal on Matrix Analysis and Applications, 27(2), 2005.Google Scholar
- I. S. Duff and J. K. Reid. The Multifrontal Solution of Indefinite Sparse Symmetric Linear. ACM Transactions on Mathematical Software, 9(3), 1983.Google Scholar
- I. S. Duff and J. K. Reid. MA48, a Fortran code for direct solution of sparse un-symmetric linear systems of equations. Science and Engineering Research Council, Rutherford Appleton Laboratory, 1993.Google Scholar
- I. S. Duff and J. K. Reid. The Design of MA48: A Code for The Direct Solution of Sparse Unsymmetric Linear Systems of Equations. ACM Transactions on Mathematical Software, 22(2), 1996.Google Scholar
- I. S. Duff and J. A. Scott. A Parallel Direct Solver for Large Sparse Highly Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 30(2), 2004.Google Scholar
- S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in Unsymmetric Sparse Symbolic Factorization. SIAM Journal on Matrix Analysis and Applications, 13(1), 1992.Google Scholar
- S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in a Sparse Partial Pivoting Code. SIAM Journal on Scientific Computing, 14(1), 1993.Google Scholar
- K. Eswar, P. Sadayappan, C.-H. Huang, and V. Visvanathan. Supernodal Sparse Cholesky Factorization on Distributed-memory Multiprocessors. In ICPP '93, volume 3, 1993.Google Scholar
- K. Eswar, P. Sadayappan, and V. Visvanathan. Multifrontal Factorization of Sparse Matrices on Shared-Memory Multiprocessors. In ICPP '91, 1991.Google Scholar
- A. Gaihre, X. S. Li, and H. Liu. gSoFa: Scalable Sparse Symbolic LU Factorization on GPUs. IEEE Transactions on Parallel and Distributed Systems, 33(4), 2022.Google Scholar
- L. Grigori, J. W. Demmel, and X. S. Li. Parallel Symbolic Factorization for Sparse LU with Static Pivoting. SIAM Journal on Scientific Computing, 29(3), 2007.Google ScholarCross Ref
- L. Grigori, J. W. Demmel, and H. Xiang. CALU: A Communication Optimal LU Factorization Algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4), 2011.Google Scholar
- A. Gupta. WSMP: Watson Sparse Matrix Package (Part-I: Direct Solution of Symmetric Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.Google Scholar
- A. Gupta. WSMP: Watson Sparse Matrix Package (Part-II: Direct Solution of General Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.Google Scholar
- K. He, S. X.-D. Tan, H. Wang, and G. Shi. GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis. IEEE Transactions on Very Large Scale Integration Systems, 24(3), 2015.Google Scholar
- G. Karypis and V. Kumar. METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-reducing Orderings of Sparse Matrices. 1997.Google Scholar
- B. Kumar, P. Sadayappan, and C.-H. Huang. On Sparse Matrix Reordering for Parallel Factorization. In ICS '94, 1994.Google Scholar
- W.-K. Lee, R. Achar, and M. S. Nakhla. Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation. IEEE Transactions on Very Large Scale Integration Systems, 26(11), 2018.Google Scholar
- X. S. Li. An Overview of SuperLU: Algorithms, Implementation, and User Interface. ACM Transactions on Mathematical Software, 31(3), 2005.Google Scholar
- X. S. Li and J. W. Demmel. Making Sparse Gaussian Elimination Scalable by Static Pivoting. In SC '98, 1998.Google Scholar
- X. S. Li and J. W. Demmel. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 29(2), 2003.Google ScholarDigital Library
- X. S. Li, P. Lin, Y. Liu, and P. Sao. Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver. ACM Transactions on Mathematical Software, 49(1), 2023.Google ScholarDigital Library
- J. W. H. Liu. The Role of Elimination Trees in Sparse Factorization. SIAM Journal on Matrix Analysis and Applications, 11(1), 1990.Google Scholar
- W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Euro-Par '16, 2016.Google ScholarDigital Library
- Y. Liu, N. Ding, P. Sao, S. Williams, and X. S. Li. Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters. In SC '23, 2023.Google Scholar
- Y. Liu, M. Jacquelin, P. Ghysels, and X. S. Li. Highly Scalable Distributed-memory Sparse Triangular Solution Algorithms. In CSC '18, 2018.Google Scholar
- S. Peng and S. X.-D. Tan. GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation. IEEE Design & Test, 37(3), 2020.Google ScholarCross Ref
- L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang. Sparse LU Factorization for Parallel Circuit Simulation on GPU. In DAC '12, 2012.Google Scholar
- P. Sadayappan and S. K. Rao. Communication Reduction for Distributed Sparse Matrix Factorization on a Processor Mesh. In SC '89, 1989.Google ScholarDigital Library
- P. Sadayappan and V. Visvanathan. Efficient Sparse Matrix Factorization for Circuit Simulation On Vector Supercomputers. In DAC '98, 1989.Google Scholar
- P. Sao, R. Kannan, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Sparse Triangular Solver. In ICS '19, 2019.Google ScholarDigital Library
- P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices. In IPDPS '18, 2018.Google ScholarCross Ref
- P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Algorithm for Sparse LU Factorization on Heterogeneous Systems. Journal of Parallel and Distributed Computing, 131, 2019.Google Scholar
- P. Sao, X. Liu, R. Vuduc, and X. S. Li. A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems. In IPDPS '15, 2015.Google ScholarDigital Library
- P. Sao, R. Vuduc, and X. S. Li. A Distributed CPU-GPU Sparse Direct Solver. In Euro-Par '14, 2014.Google ScholarCross Ref
- O. Schenk and K. Gärtner. Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. Future Generation Computer Systems, 20(3), 2004.Google Scholar
- O. Schenk and K. Gärtner. Two-level Dynamic Scheduling in PARDISO: Improved scalability on Shared memory Multiprocessing Systems. Parallel Computing, 28(2), 2002.Google Scholar
- O. Schenk, K. Gärtner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT Numerical Mathematics, 40, 2000.Google Scholar
- O. Schenk, K. Gärtner, W. Fichtner, and A. Stricker. PARDISO: a High-Performance Serial and Parallel Sparse Linear Solver in Semiconductor Device Simulation. Future Generation Computer Systems, 18(1), 2001.Google Scholar
- G. Tan, C. Shui, Y. Wang, X. Yu, and Y. Yan. Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems. IEEE Transactions on Parallel and Distributed Systems, 32(9), 2021.Google ScholarDigital Library
- M. Tian, J. Wang, Z. Zhang, W. Du, J. Pan, and T. Liu. swSuperLU: A Highly Scalable Sparse Direct Solver on Sunway Manycore Architecture. The Journal of Supercomputing, 78(9), 2022.Google Scholar
- T. Wang, W. Li, H. Pei, Y. Sun, Z. Jin, and W. Liu. Accelerating Sparse LU Factorization with Density-Aware Adaptive Matrix Multiplication for Circuit Simulation. In DAC '23, 2023.Google Scholar
- J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li. Superfast Multifrontal Method for Large Structured Linear Systems of Equations. SIAM Journal on Matrix Analysis and Applications, 31(3), 2010.Google ScholarCross Ref
- Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. End-to-End LU Factorization of Large Matrices on GPUs. In PPoPP '23, 2023.Google Scholar
- Z. Yan, B. Xie, X. Li, and Y. Bao. Exploiting Architecture Advances for Sparse Solvers in Circuit Simulation. In DATE '22, 2022.Google Scholar
- C. Yu, W. Wang, and D. Pierce. A CPU-GPU Hybrid Approach for the Unsymmetric Multifrontal Method. Parallel Computing, 37(12), 2011.Google Scholar
- J. Zhao, Y. Wen, Y. Luo, Z. Jin, W. Liu, and Z. Zhou. SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs. In DAC '21, 2021.Google Scholar
Index Terms
- PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems
Recommendations
A Sparse Symmetric Indefinite Direct Solver for GPU Architectures
In recent years, there has been considerable interest in the potential for graphics processing units (GPUs) to speed up the performance of sparse direct linear solvers. Efforts have focused on symmetric positive-definite systems for which no pivoting is ...
swSuperLU: A highly scalable sparse direct solver on Sunway manycore architecture
AbstractSparse LU factorization is essential for scientific and engineering simulations. In this work, we present swSuperLU, a highly scalable sparse direct solver on Sunway manycore architecture based on sparse LU factorization. To improve the ...
A direct method to solve block banded block Toeplitz systems with non-banded Toeplitz blocks
A fast solution algorithm is proposed for solving block banded block Toeplitz systems with non-banded Toeplitz blocks. The algorithm constructs the circulant transformation of a given Toeplitz system and then by means of the Sherman-Morrison-Woodbury ...
Comments