skip to main content
10.1145/3581784.3607050acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections

PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems

Authors Info & Claims
Published:11 November 2023Publication History

ABSTRACT

Sparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. However, such a data layout may bring more unevenness when the structure of the input matrix is not ideal, and using dense BLAS may waste many floating-point operations on zero fill-ins.

In this paper, we propose a new sparse direct solver called PanguLU. Unlike the multifrontal/supernodal layout, our work relies on simpler regular 2D blocking and stores the blocks in their sparse forms to avoid any extra fill-ins. Based on the sparse patterns of the blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust the mapping of blocks to processes for overall more balanced workload, and propose a synchronisation-free communication strategy considering the dependencies among different sub-tasks to reduce overall latency overhead.

Experiments on two distributed heterogeneous platforms consisting of 128 NVIDIA A100 GPUs and 128 AMD MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over the latest SuperLU_DIST, and scales up to 47.51x and 74.84x on the 128 A100 and MI50 GPUs over a single GPU, respectively.

References

  1. https://www.top500.org/. 2023.Google ScholarGoogle Scholar
  2. E. Agullo, P. R. Amestoy, A. Buttari, A. Guermouche, J.-Y. L'Excellent, and F.-H. Rouet. Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations. SIAM Journal on Scientific Computing, 38(3), 2016.Google ScholarGoogle Scholar
  3. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU Factorization for Accelerator-Based Systems. In AICCSA '11. IEEE, 2011.Google ScholarGoogle Scholar
  4. P. R. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. Improving Multifrontal Methods by Means of Block Low-rank Representations. SIAM Journal on Scientific Computing, 37(3), 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. R. Amestoy, T. A. Davis, and I. S. Duff. An Approximate Minimum Degree Ordering Algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4), 1996.Google ScholarGoogle Scholar
  6. P. R. Amestoy, T. A. Davis, and I. S. Duff. Algorithm 837: AMD, An Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google ScholarGoogle Scholar
  7. P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1), 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. R. Amestoy, I. S. Duff, J.-Y. L' Excellent, and J. Koster. MUMPS: A General Purpose Distributed Memory Sparse Solver. In International Workshop on Applied Parallel Computing, 2000.Google ScholarGoogle Scholar
  9. P. R. Amestoy, I. S. Duff, and C. Vömel. Task Scheduling in an Asynchronous Distributed Memory Multifrontal Solver. SIAM Journal on Matrix Analysis and Applications, 26(2), 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. R. Amestoy and C. Puglisi. An Unsymmetrized Multifrontal LU Factorization. SIAM Journal on Matrix Analysis and Applications, 24(2), 2002.Google ScholarGoogle Scholar
  11. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, and A. McKenney. LAPACK Users' Guide. SIAM, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Cayrols, I. S. Duff, and F. Lopez. Parallelization of the Solve Phase in A Task-based Cholesky Solver Using A Sequential Task Flow Model. The International Journal of High Performance Computing Applications, 34(3), 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Chen, L. Ren, Y. Wang, and H. Yang. GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling. IEEE Transactions on Parallel and Distributed Systems, 26(3), 2015.Google ScholarGoogle Scholar
  14. X. Chen, Y. Wang, and H. Yang. An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation. In ASP-DAC '12, 2012.Google ScholarGoogle Scholar
  15. X. Chen, Y. Wang, and H. Yang. NICSLU: An Adaptive Sparse Matrix Solver For parallel Circuit Simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(2), 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Chen, Y. Wang, and H. Yang. A Fast Parallel Sparse Solver For SPICE-based Circuit Simulators. In DATE '15, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. A. Davis. Algorithm 832: UMFPACK V4.3---an Unsymmetric-Pattern Multifrontal Method. ACM Transactions on Mathematical Software, 30(2), 2004.Google ScholarGoogle Scholar
  18. T. A. Davis. Direct Methods for Sparse Linear Systems. SIAM, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google ScholarGoogle Scholar
  20. T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. Algorithm 836: COLAMD, A Column Approximate Minimum Degree Ordering Algorithm. ACM Transactions on Mathematical Software, 30(3), 2004.Google ScholarGoogle Scholar
  21. T. A. Davis and W. W. Hager. Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves. ACM Transactions on Mathematical Software, 35(4), 2009.Google ScholarGoogle Scholar
  22. T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38(1), 2011.Google ScholarGoogle Scholar
  23. T. A. Davis and E. Palamadai Natarajan. Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems. ACM Transactions on Mathematical Software, 37(3), 2010.Google ScholarGoogle Scholar
  24. J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu. A Supernodal Approach to Sparse Partial Pivoting. SIAM Journal on Matrix Analysis and Applications, 20(3), 1999.Google ScholarGoogle Scholar
  25. J. W. Demmel, J. R. Gilbert, and X. S. Li. An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Ding, Y. Liu, S. Williams, and X. S. Li. A Message-driven, Multi-GPU Parallel Sparse Triangular Solver. In ACDA21 '21, 2021.Google ScholarGoogle Scholar
  27. I. S. Duff. A Survey of Sparse Matrix Research. Proceedings of the IEEE, 65(4), 1977.Google ScholarGoogle Scholar
  28. I. S. Duff. Parallel Implementation of Multifrontal Schemes. Parallel computing, 3(3), 1986.Google ScholarGoogle Scholar
  29. I. S. Duff. Parallel Algorithms for Sparse Matrix Solution. Parallel Computing, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  30. I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Inc, 2nd edition, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  31. I. S. Duff, J. D. Hogg, and F. Lopez. A New Sparse Symmetric Indefinite Solver Using A Posteriori Threshold Pivoting. NLAFET Working Note, 2018.Google ScholarGoogle Scholar
  32. I. S. Duff and J. Koster. The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices. SIAM Journal on Matrix Analysis and Applications, 20(4), 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. I. S. Duff and J. Koster. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix. SIAM Journal on Matrix Analysis and Applications, 22(4), 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. S. Duff, P. Leleux, D. Ruiz, and F. S. Torun. Improving the Scalability of the ABCD Solver with a Combination of New Load Balancing and Communication Minimization Techniques. In Parco '19, 2019.Google ScholarGoogle Scholar
  35. I. S. Duff and S. Pralet. Strategies for Scaling and Pivoting for Sparse Symmetric Indefinite Problems. SIAM Journal on Matrix Analysis and Applications, 27(2), 2005.Google ScholarGoogle Scholar
  36. I. S. Duff and J. K. Reid. The Multifrontal Solution of Indefinite Sparse Symmetric Linear. ACM Transactions on Mathematical Software, 9(3), 1983.Google ScholarGoogle Scholar
  37. I. S. Duff and J. K. Reid. MA48, a Fortran code for direct solution of sparse un-symmetric linear systems of equations. Science and Engineering Research Council, Rutherford Appleton Laboratory, 1993.Google ScholarGoogle Scholar
  38. I. S. Duff and J. K. Reid. The Design of MA48: A Code for The Direct Solution of Sparse Unsymmetric Linear Systems of Equations. ACM Transactions on Mathematical Software, 22(2), 1996.Google ScholarGoogle Scholar
  39. I. S. Duff and J. A. Scott. A Parallel Direct Solver for Large Sparse Highly Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 30(2), 2004.Google ScholarGoogle Scholar
  40. S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in Unsymmetric Sparse Symbolic Factorization. SIAM Journal on Matrix Analysis and Applications, 13(1), 1992.Google ScholarGoogle Scholar
  41. S. C. Eisenstat and J. W. H. Liu. Exploiting Structural Symmetry in a Sparse Partial Pivoting Code. SIAM Journal on Scientific Computing, 14(1), 1993.Google ScholarGoogle Scholar
  42. K. Eswar, P. Sadayappan, C.-H. Huang, and V. Visvanathan. Supernodal Sparse Cholesky Factorization on Distributed-memory Multiprocessors. In ICPP '93, volume 3, 1993.Google ScholarGoogle Scholar
  43. K. Eswar, P. Sadayappan, and V. Visvanathan. Multifrontal Factorization of Sparse Matrices on Shared-Memory Multiprocessors. In ICPP '91, 1991.Google ScholarGoogle Scholar
  44. A. Gaihre, X. S. Li, and H. Liu. gSoFa: Scalable Sparse Symbolic LU Factorization on GPUs. IEEE Transactions on Parallel and Distributed Systems, 33(4), 2022.Google ScholarGoogle Scholar
  45. L. Grigori, J. W. Demmel, and X. S. Li. Parallel Symbolic Factorization for Sparse LU with Static Pivoting. SIAM Journal on Scientific Computing, 29(3), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  46. L. Grigori, J. W. Demmel, and H. Xiang. CALU: A Communication Optimal LU Factorization Algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4), 2011.Google ScholarGoogle Scholar
  47. A. Gupta. WSMP: Watson Sparse Matrix Package (Part-I: Direct Solution of Symmetric Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.Google ScholarGoogle Scholar
  48. A. Gupta. WSMP: Watson Sparse Matrix Package (Part-II: Direct Solution of General Sparse Systems). IBM TJ Watson Research Center, Yorktown Heights, NY, 2000.Google ScholarGoogle Scholar
  49. K. He, S. X.-D. Tan, H. Wang, and G. Shi. GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis. IEEE Transactions on Very Large Scale Integration Systems, 24(3), 2015.Google ScholarGoogle Scholar
  50. G. Karypis and V. Kumar. METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-reducing Orderings of Sparse Matrices. 1997.Google ScholarGoogle Scholar
  51. B. Kumar, P. Sadayappan, and C.-H. Huang. On Sparse Matrix Reordering for Parallel Factorization. In ICS '94, 1994.Google ScholarGoogle Scholar
  52. W.-K. Lee, R. Achar, and M. S. Nakhla. Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation. IEEE Transactions on Very Large Scale Integration Systems, 26(11), 2018.Google ScholarGoogle Scholar
  53. X. S. Li. An Overview of SuperLU: Algorithms, Implementation, and User Interface. ACM Transactions on Mathematical Software, 31(3), 2005.Google ScholarGoogle Scholar
  54. X. S. Li and J. W. Demmel. Making Sparse Gaussian Elimination Scalable by Static Pivoting. In SC '98, 1998.Google ScholarGoogle Scholar
  55. X. S. Li and J. W. Demmel. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Transactions on Mathematical Software, 29(2), 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. X. S. Li, P. Lin, Y. Liu, and P. Sao. Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver. ACM Transactions on Mathematical Software, 49(1), 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. W. H. Liu. The Role of Elimination Trees in Sparse Factorization. SIAM Journal on Matrix Analysis and Applications, 11(1), 1990.Google ScholarGoogle Scholar
  58. W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Euro-Par '16, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Liu, N. Ding, P. Sao, S. Williams, and X. S. Li. Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters. In SC '23, 2023.Google ScholarGoogle Scholar
  60. Y. Liu, M. Jacquelin, P. Ghysels, and X. S. Li. Highly Scalable Distributed-memory Sparse Triangular Solution Algorithms. In CSC '18, 2018.Google ScholarGoogle Scholar
  61. S. Peng and S. X.-D. Tan. GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation. IEEE Design & Test, 37(3), 2020.Google ScholarGoogle ScholarCross RefCross Ref
  62. L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang. Sparse LU Factorization for Parallel Circuit Simulation on GPU. In DAC '12, 2012.Google ScholarGoogle Scholar
  63. P. Sadayappan and S. K. Rao. Communication Reduction for Distributed Sparse Matrix Factorization on a Processor Mesh. In SC '89, 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. P. Sadayappan and V. Visvanathan. Efficient Sparse Matrix Factorization for Circuit Simulation On Vector Supercomputers. In DAC '98, 1989.Google ScholarGoogle Scholar
  65. P. Sao, R. Kannan, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Sparse Triangular Solver. In ICS '19, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices. In IPDPS '18, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  67. P. Sao, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Algorithm for Sparse LU Factorization on Heterogeneous Systems. Journal of Parallel and Distributed Computing, 131, 2019.Google ScholarGoogle Scholar
  68. P. Sao, X. Liu, R. Vuduc, and X. S. Li. A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems. In IPDPS '15, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. P. Sao, R. Vuduc, and X. S. Li. A Distributed CPU-GPU Sparse Direct Solver. In Euro-Par '14, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  70. O. Schenk and K. Gärtner. Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. Future Generation Computer Systems, 20(3), 2004.Google ScholarGoogle Scholar
  71. O. Schenk and K. Gärtner. Two-level Dynamic Scheduling in PARDISO: Improved scalability on Shared memory Multiprocessing Systems. Parallel Computing, 28(2), 2002.Google ScholarGoogle Scholar
  72. O. Schenk, K. Gärtner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT Numerical Mathematics, 40, 2000.Google ScholarGoogle Scholar
  73. O. Schenk, K. Gärtner, W. Fichtner, and A. Stricker. PARDISO: a High-Performance Serial and Parallel Sparse Linear Solver in Semiconductor Device Simulation. Future Generation Computer Systems, 18(1), 2001.Google ScholarGoogle Scholar
  74. G. Tan, C. Shui, Y. Wang, X. Yu, and Y. Yan. Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems. IEEE Transactions on Parallel and Distributed Systems, 32(9), 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. M. Tian, J. Wang, Z. Zhang, W. Du, J. Pan, and T. Liu. swSuperLU: A Highly Scalable Sparse Direct Solver on Sunway Manycore Architecture. The Journal of Supercomputing, 78(9), 2022.Google ScholarGoogle Scholar
  76. T. Wang, W. Li, H. Pei, Y. Sun, Z. Jin, and W. Liu. Accelerating Sparse LU Factorization with Density-Aware Adaptive Matrix Multiplication for Circuit Simulation. In DAC '23, 2023.Google ScholarGoogle Scholar
  77. J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li. Superfast Multifrontal Method for Large Structured Linear Systems of Equations. SIAM Journal on Matrix Analysis and Applications, 31(3), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  78. Y. Xia, P. Jiang, G. Agrawal, and R. Ramnath. End-to-End LU Factorization of Large Matrices on GPUs. In PPoPP '23, 2023.Google ScholarGoogle Scholar
  79. Z. Yan, B. Xie, X. Li, and Y. Bao. Exploiting Architecture Advances for Sparse Solvers in Circuit Simulation. In DATE '22, 2022.Google ScholarGoogle Scholar
  80. C. Yu, W. Wang, and D. Pierce. A CPU-GPU Hybrid Approach for the Unsymmetric Multifrontal Method. Parallel Computing, 37(12), 2011.Google ScholarGoogle Scholar
  81. J. Zhao, Y. Wen, Y. Luo, Z. Jin, W. Liu, and Z. Zhou. SFLU: Synchronization-Free Sparse LU Factorization for Fast Circuit Simulation on GPUs. In DAC '21, 2021.Google ScholarGoogle Scholar

Index Terms

  1. PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
        November 2023
        1428 pages
        ISBN:9798400701092
        DOI:10.1145/3581784

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader