ABSTRACT
The sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher parallelism. Existing work mainly focuses on reducing dependencies and synchronizations in the level-set methods. However, the 2D block layout of the input matrix has been largely ignored in designing more efficient SpTRSV algorithms.
In this paper, we implement three block algorithms, i.e., column block, row block and recursive block algorithms, for parallel SpTRSV on modern GPUs, and propose an adaptive approach that can automatically select the best kernels according to input sparsity structures. By testing 159 sparse matrices on two high-end NVIDIA GPUs, the experimental results demonstrate that the recursive block algorithm has the best performance among the three block algorithms, and it is on average 4.72x (up to 72.03x) and 9.95x (up to 61.08x) faster than cuSPARSE v2 and Sync-free methods, respectively. Besides, our method merely needs moderate cost for preprocessing the input matrix, thus is highly efficient for multiple right-hand sides and iterative scenarios.
- [1] E. Agullo, A. Buttari, A. Guermouche, and F. Lopez. Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems. ACM Trans. Math. Softw., 43(2), 2016.Google ScholarDigital Library
- [2] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180:012037, 2009.Google ScholarCross Ref
- [3] K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, A. Esposito, and D. Keyes. Exploiting Data Sparsity for Large-Scale Matrix Computations. In Euro-Par ’18, pages 721–734, 2018.Google ScholarDigital Library
- [4] P. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary. On the Complexity of the Block Low-Rank Multifrontal Factorization. SIAM Journal on Scientific Computing, 39(4):A1710–A1740, 2017.Google ScholarDigital Library
- [5] P. R. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures. ACM Trans. Math. Softw., 45(1), 2019.Google ScholarDigital Library
- [6] P. R. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures. ACM Trans. Math. Softw., 45(1), 2019.Google ScholarDigital Library
- [7] E. Anderson and Y. Saad. Solving Sparse Triangular Linear Systems on Parallel Computers. International Journal of High Speed Computing, 1(1):73–95, 1989.Google ScholarDigital Library
- [8] H. Anzt, E. Chow, and J. Dongarra. Iterative Sparse Triangular Solves for Preconditioning. In Euro-Par ’15, pages 650–661. 2015.Google Scholar
- [9] H. Anzt, E. Chow, and J. Dongarra. ParILUT–A New Parallel Threshold ILU Factorization. SIAM Journal on Scientific Computing, 40(4):C503–C519, 2018.Google ScholarDigital Library
- [10] H. Anzt, E. Chow, T. Huckle, and J. Dongarra. Batched Generation of Incomplete Sparse Approximate Inverses on GPUs. In 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pages 49–56, 2016.Google Scholar
- [11] H. Anzt, E. Chow, D. B. Szyld, and J. Dongarra. Domain Overlap for Iterative Sparse Triangular Solves on GPUs. In Software for Exascale Computing - SPPEXA 2013-2015, pages 527–545, 2016.Google ScholarCross Ref
- [12] H. Anzt, M. Gates, J. Dongarra, M. Kreutzer, G. Wellein, and M. Köhler. Preconditioned Krylov solvers on GPUs. Parallel Computing, 68:32 – 44, 2017.Google Scholar
- [13] H. Anzt, T. Huckle, J. Brackle, and J. Dongarra. Incomplete Sparse Approximate Inverses for Parallel Preconditioning. Parallel Computing, 71:1–22, 2018.Google ScholarCross Ref
- [14] A. M. Bradley. A Hybrid Multithreaded Direct Sparse Triangular Solver. In SIAM CSC workshop ’16, pages 13–22, 2016.Google Scholar
- [15] A. Buluç and J. R. Gilbert. On the Representation and Multiplication of Hypersparse Matrices. In IPDPS ’08, pages 1–11, 2008.Google ScholarCross Ref
- [16] D. Buono, F. Petrini, F. Checconi, X. Liu, X. Que, C. Long, and T.-C. Tuan. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In ICS ’16, 2016.Google ScholarDigital Library
- [17] A. Buttari, V. Eijkhout, J. Langou, and S. Filippone. Performance Optimization and Modeling of Blocked Sparse Kernels. The International Journal of High Performance Computing Applications, 21(4):467–484, 2007.Google ScholarDigital Library
- [18] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. Parallel Tiled QR Factorization for Multicore Architectures. Concurrency and Computation: Practice and Experience, 20(13):1573–1590, 2008.Google ScholarCross Ref
- [19] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures. Parallel Computing, 35(1):38 – 53, 2009.Google ScholarDigital Library
- [20] A. Charara, D. Keyes, and H. Ltaief. A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures. Concurrency and Computation: Practice and Experience, 29(15):e4187, 2017.Google ScholarCross Ref
- [21] A. Charara, H. Ltaief, and D. Keyes. Redesigning Triangular Dense Matrix Computations on GPUs. In Euro-Par ’16, pages 477–489, 2016.Google ScholarDigital Library
- [22] J. Chen, J. Fang, W. Liu, T. Tang, and C. Yang. clMF: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization. Future Generation Computer Systems, 108:1192–1205, 2020.Google ScholarCross Ref
- [23] K. Cheshmi, S. Kamil, M. M. Strout, and M. M. Dehnavi. Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis. In SC ’17, page 1–13, 2017.Google ScholarDigital Library
- [24] K. Cheshmi, S. Kamil, M. M. Strout, and M. M. Dehnavi. ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism. In SC ’18, pages 779–793, 2018.Google ScholarDigital Library
- [25] E. Chow, H. Anzt, J. Scott, and J. Dongarra. Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning. Journal of Parallel and Distributed Computing, 119:219 – 230, 2018.Google Scholar
- [26] E. Chow and A. Patel. Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2):C169–C193, 2015.Google ScholarDigital Library
- [27] T. Cojean, A. Guermouche, A. Hugo, R. Namyst, and P. Wacrenier. Resource Aggregation for Task-Based Cholesky Factorization on Top of Modern Architectures. Parallel Computing, 83:73 – 92, 2019.Google Scholar
- [28] T. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2006.Google ScholarCross Ref
- [29] T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, 2011.Google ScholarDigital Library
- [30] N. Ding, S. Williams, Y. Liu, and X. S. Li. Leveraging One-Sided Communication for Sparse Triangular Solvers. In SIAM PP ’20, pages 93–105, 2020.Google Scholar
- [31] J. Dongarra, V. Eijkhout, and P. Łuszczek. Recursive Approach in Sparse Matrix LU Factorization. Scientific Programming, 9(1):51–60, 2001.Google ScholarDigital Library
- [32] J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. Achieving Numerical Accuracy and High Performance Using Recursive Tile LU Factorization with Partial Pivoting. Concurrency and Computation: Practice and Experience, 26(7):1408–1431, 2014.Google ScholarDigital Library
- [33] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, Inc., 2nd edition, 2017.Google ScholarCross Ref
- [34] I. S. Duff and B. Uçar. On the Block Triangular Form of Symmetric Matrices. SIAM Review, 52(3):455–470, 2010.Google ScholarDigital Library
- [35] E. Dufrechou and P. Ezzatti. A New GPU Algorithm to Compute a Level Set-Based Analysis for the Parallel Solution of Sparse Triangular Systems. In IPDPS ’18, pages 920–929, 2018.Google ScholarCross Ref
- [36] E. Dufrechou and P. Ezzatti. Solving Sparse Triangular Linear Systems in Modern GPUs: A Synchronization-Free Algorithm. In PDP ’18, pages 196–203, 2018.Google ScholarCross Ref
- [37] J. González-Domínguez, M. J. Martín, G. L. Taboada, and J. Touriño. Dense Triangular Solvers on Multicore Clusters using UPC. Procedia Computer Science, 4:231 – 240, 2011.Google Scholar
- [38] L. Grigori, J. W. Demmel, and X. S. Li. Parallel Symbolic Factorization for Sparse LU with Static Pivoting. SIAM Journal on Scientific Computing, 29(3):1289–1314, 2007.Google ScholarDigital Library
- [39] A. Haidar, H. Ltaief, A. YarKhan, and J. Dongarra. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Concurrency and Computation: Practice and Experience, 24(3):305–321, 2012.Google ScholarDigital Library
- [40] J. D. Hogg. A Fast Dense Triangular Solve in CUDA. SIAM Journal on Scientific Computing, 35(3):C303–C322, 2013.Google ScholarCross Ref
- [41] K. Hou, W. Liu, H. Wang, and W.-c. Feng. Fast Segmented Sort on GPUs. In ICS ’17, pages 12:1–12:10, 2017.Google ScholarDigital Library
- [42] D. Irony and S. Toledo. Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers. Parallel Processing Letters, 12(01):79–94, 2002.Google ScholarCross Ref
- [43] H. Kabir, J. D. Booth, G. Aupy, A. Benoit, Y. Robert, and P. Raghavan. STS-k: A Multilevel Sparse Triangular Solution Scheme for NUMA Multicores. In SC ’15, pages 55:1–55:11, 2015.Google ScholarDigital Library
- [44] A. Li, W. Liu, M. R. B. Kristensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In SC ’17, pages 26:1–26:14, 2017.Google ScholarDigital Library
- [45] R. Li and Y. Saad. GPU-Accelerated Preconditioned Iterative Linear Solvers. The Journal of Supercomputing, 63(2):443–466, 2013.Google ScholarDigital Library
- [46] X. S. Li. An Overview of SuperLU: Algorithms, Implementation, and User Interface. ACM Trans. Math. Softw., 31(3):302–325, 2005.Google ScholarDigital Library
- [47] J. Liu, X. He, W. Liu, and G. Tan. Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication. International Journal of Parallel Programming, page 403–417, 2019.Google ScholarDigital Library
- [48] W. Liu. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. PhD thesis, University of Copenhagen, 2015.Google Scholar
- [49] W. Liu, A. Li, J. Hogg, I. S. Duff, and B. Vinter. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Euro-Par ’16, pages 617–630, 2016.Google ScholarDigital Library
- [50] W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides. Concurrency and Computation: Practice and Experience, 29(21):e4244–n/a, 2017.Google Scholar
- [51] W. Liu and B. Vinter. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. Journal of Parallel and Distributed Computing, 85(C):47–61, 2015.Google Scholar
- [52] W. Liu and B. Vinter. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In ICS ’15, pages 339–350, 2015.Google ScholarDigital Library
- [53] W. Liu and B. Vinter. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heterogeneous Processors. Parallel Computing, 49(C):179–193, 2015.Google Scholar
- [54] Y. Liu, M. Jacquelin, P. Ghysels, and X. S. Li. Highly Scalable Distributed-Memory Sparse Triangular Solution Algorithms. In SIAM CSC workshop ’18, pages 87–96.Google Scholar
- [55] K. K. Matam and K. Kothapalli. Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU. In ICPP ’11, pages 612–621, 2011.Google ScholarDigital Library
- [56] J. Mayer. Parallel Algorithms for Solving Linear Systems with Sparse Triangular Matrices. Computing, 86(4):291–312, 2009.Google ScholarDigital Library
- [57] M. S. Mohammadi, T. Yuki, K. Cheshmi, E. C. Davis, M. Hall, M. M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. M. Strout. Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors. In PLDI ’19, page 594–609, 2019.Google ScholarDigital Library
- [58] M. Naumov. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Technical report, NVIDIA, 2011.Google Scholar
- [59] M. Naumov, P. Castonguay, and J. Cohen. Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU. Nvidia White Paper, 2015.Google Scholar
- [60] J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In ISC ’14, pages 124–140, 2014.Google ScholarDigital Library
- [61] A. Picciau, G. E. Inggs, J. Wickerson, E. C. Kerrigan, and G. A. Constantinides. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs. In HiPC ’16, 2016.Google ScholarCross Ref
- [62] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.Google ScholarCross Ref
- [63] F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti. Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization. In MICRO ’19, page 347–358, 2019.Google ScholarDigital Library
- [64] J. H. Saltz. Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors. SIAM Journal on Scientific and Statistical Computing, 11(1):123–144, 1990.Google ScholarDigital Library
- [65] P. Sao, R. Kannan, X. S. Li, and R. Vuduc. A Communication-Avoiding 3D Sparse Triangular Solver. In ICS ’19, page 127–137, 2019.Google ScholarDigital Library
- [66] E. Saule, K. Kaya, and Ü. V. Çatalyürek. Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi. In PPAM ’14, pages 559–570, 2014.Google ScholarCross Ref
- [67] R. Schreiber and W.-P. Tang. Vectorizing the Conjugate Gradient Method. In Proceedings of the Symposium on CYBER 205 Applications, 1982.Google Scholar
- [68] M. M. Strout, M. Hall, and C. Olschanowsky. The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code. Proceedings of the IEEE, 106(11):1921–1934, 2018.Google Scholar
- [69] M. M. Strout, A. LaMielle, L. Carter, J. Ferrante, B. Kreaseck, and C. Olschanowsky. An Approach for Code Generation in the Sparse Polyhedral Framework. Parallel Computing, 53:32 – 57, 2016.Google Scholar
- [70] J. Su, F. Zhang, W. Liu, B. He, R. Wu, X. Du, and R. Wang. CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs. In ICPP ’20, 2020.Google ScholarDigital Library
- [71] B. Suchoski, C. Severn, M. Shantharam, and P. Raghavan. Adapting Sparse Triangular Solution to GPUs. In ICPPW ’12, pages 140–148, 2012.Google ScholarDigital Library
- [72] D. T. Vooturi, G. Varma, and K. Kothapalli. Dynamic Block Sparse Reparameterization of Convolutional Neural Networks. In ICCV ’19 Workshops, Oct 2019.Google Scholar
- [73] B. Uçar and C. Aykanat. Partitioning Sparse Matrices for Parallel Preconditioned Iterative Methods. SIAM Journal on Scientific Computing, 29(4):1683–1709, 2007.Google ScholarDigital Library
- [74] A. Venkat, M. S. Mohammadi, J. Park, H. Rong, R. Barik, M. M. Strout, and M. Hall. Automating Wavefront Parallelization for Sparse Matrix Computations. In SC ’16, pages 480–491, 2016.Google ScholarCross Ref
- [75] D. T. Vooturi and K. Kothapalli. Efficient Sparse Neural Networks Using Regularized Multi Block Sparsity Pattern on a GPU. In HiPC ’19, pages 215–224, 2019.Google ScholarCross Ref
- [76] R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic Performance Tuning and Analysis of Sparse Triangular Solve. In ICS ’02 Workshop, 2002.Google Scholar
- [77] H. Wang, W. Liu, K. Hou, and W.-c. Feng. Parallel Transposition of Sparse Data Structures. In ICS ’16, pages 33:1–33:13, 2016.Google ScholarDigital Library
- [78] X. Wang, W. Liu, W. Xue, and L. Wu. SwSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. In PPoPP ’18, page 338–353, 2018.Google ScholarDigital Library
- [79] X. Wang, P. Xu, W. Xue, Y. Ao, C. Yang, H. Fu, L. Gan, G. Yang, and W. Zheng. A Fast Sparse Triangular Solver for Structured-Grid Problems on Sunway Many-Core Processor SW26010. In ICPP ’18, 2018.Google Scholar
- [80] T. Wicky, E. Solomonik, and T. Hoefler. Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations. In IPDPS ’17, pages 678–687, 2017.Google ScholarCross Ref
- [81] M. Wittmann, G. Hager, R. Janalik, M. Lanser, A. Klawonn, O. Rheinbach, O. Schenk, and G. Wellein. Multicore Performance Engineering of Sparse Triangular Solves Using a Modified Roofline Model. In SBAC-PAD ’18, pages 233–241, 2018.Google ScholarCross Ref
- [82] M. M. Wolf, M. A. Heroux, and E. G. Boman. Factors Impacting Performance of Multithreaded Sparse Triangular Solve. In VECPAR ’10, pages 32–44. 2011.Google Scholar
- [83] Z. Xie, G. Tan, W. Liu, and N. Sun. IA-SpGEMM: An Input-Aware Auto-Tuning Framework for Parallel Sparse Matrix-Matrix Multiplication. In ICS ’19, pages 94–105, 2019.Google ScholarDigital Library
- [84] B. Yılmaz, B. Sipahioğrlu, N. Ahmad, and D. Unat. Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems. In HPC Asia ’20, page 188–198, 2020.Google ScholarDigital Library
- [85] F. Zhang, W. Liu, N. Feng, J. Zhai, and X. Du. Performance Evaluation and Analysis of Sparse Matrix and Graph Kernels on Heterogeneous Processors. CCF Transactions on High Performance Computing, pages 131–143, 2019.Google ScholarCross Ref
- [86] F. Zhang, J. Zhai, B. Wu, B. He, W. Chen, and X. Du. Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. IEEE Transactions on Knowledge and Data Engineering, 2019.Google ScholarCross Ref
Recommendations
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingSparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
PPoPP '18Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
A Prediction Framework for Fast Sparse Triangular Solves
Euro-Par 2020: Parallel ProcessingAbstractSparse triangular solve (SpTRSV) is an important linear algebra kernel, finding extensive uses in numerical and scientific computing. The parallel implementation of SpTRSV is a challenging task due to the sequential nature of the steps involved. ...
Comments