ABSTRACT
Sparse triangular solve (SpTRSV) is an important scientific kernel used in several applications such as preconditioners for Krylov methods. Parallelizing SpTRSV on multi-core systems is challenging since it exhibits limited parallelism due to computational dependencies and introduces high parallelization overhead due to finegrained and unbalanced nature of workloads. We propose a novel method, named Adaptive Level Binning (ALB), that addresses these challenges by eliminating redundant synchronization points and adapting the work granularity with an efficient load balancing strategy. Similar to the commonly used level-set methods for solving SpTRSV, ALB constructs level-sets of rows, where each level can be computed in parallel. Differently, ALB bins rows to levels adaptively and reduces redundant dependencies between rows. On an Intel® Xeon® Gold 6148 processor and NVIDIA® Tesla V100 GPU, ALB obtains 1.83x speedup on average and up to 5.28x speedup over Intel MKL and, over NVIDIA cuSPARSE, an average speedup of 2.80x and a maximum speedup of 39.40x for 29 matrices selected from Suite Sparse Matrix Collection.
- JosÃl' I. Aliaga, Ernesto Dufrechou, Pablo Ezzatti, and Enrique S. Quintana-OrtÃη. 2019. Accelerating the task/data-parallel version of ILUPACKs BiCG in multi-CPU/GPU configurations. Parallel Comput. 85 (2019), 79 -- 87. https://doi.org/10.1016/j.parco.2019.02.005Google ScholarDigital Library
- Edward Anderson and Yousef Saad. 1989. Solving Sparse Triangular Linear Systems on Parallel Computers. International Journal of High Speed Computing 1, 1 (1989), 73--95. https://doi.org/10.1142/S0129053389000056Google ScholarDigital Library
- Hartwig Anzt, Edmond Chow, and Jack Dongarra. 2015. Iterative Sparse Triangular Solves for Preconditioning. In EuroPar 2015. Springer Berlin, Springer Berlin, Vienna, Austria. https://doi.org/10.1007/978-3-662-48096-0_50Google ScholarCross Ref
- Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. 68 (oct 2017), 32--44. https://doi.org/10.1016/j.parco.2017.05.006Google ScholarDigital Library
- The OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface.Google Scholar
- Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 13, 13 pages. https://doi.org/10.1145/3126908.3126936Google ScholarDigital Library
- Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2018. ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 62, 15 pages. https://doi.org/10.1109/SC.2018.00065Google ScholarDigital Library
- Edmond Chow and Aftab Patel. 2015. Fine-Grained Parallel Incomplete LU Factorization. In SIAM Journal on Scientific Computing, Vol. 37(2), C169âĂrŞC193.Google ScholarDigital Library
- NVIDIA Coporation. 2012. CUDA Toolkit 4.2, cuSPARSE library.Google Scholar
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663Google ScholarDigital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. https://doi.org/10.1145/2049662.2049663Google ScholarDigital Library
- Steven W. Hammond and Robert Schreiber. 1992. Efficient ICCG on a Shared Memory Multiprocessor. International Journal of High Speed Computing 04, 01 (1992), 1--21. https://doi.org/10.1142/S0129053392000183 arXiv:https://doi.org/10.1142/S0129053392000183Google ScholarCross Ref
- Intel Incorporated. 2019. Intel® MKL | Intel® Software. https://software.mtel.com/en-us/mkl/documentation/view-allGoogle Scholar
- T. Iwashita, H. Nakashima, and Y. Takahashi. 2012. Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 474--483. https://doi.org/10.1109/IPDPS.2012.51Google ScholarDigital Library
- Martin KÃűhler. 2017. libUFget - The UF Sparse Collection C interface. https://doi.org/10.5281/zenodo.897632Google ScholarCross Ref
- Ruipeng Li. 2017. ON PARALLEL SOLUTION OF SPARSE TRIANGULAR LINEAR SYSTEMS IN CUDA. Technical Report. arXiv:1710.04985v1 https://arxiv.org/pdf/1710.04985.pdfGoogle Scholar
- Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (feb 2013), 443--466. https://doi.org/10.1007/s11227-012-0825-3Google ScholarDigital Library
- Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Proceedings of the 22Nd International Conference on Euro Par 2016: Parallel Processing - Volume 9833. Springer-Verlag New York, Inc., New York, NY, USA, 617--630. https://doi.org/10.1007/978-3-319-43659-3_45Google ScholarDigital Library
- Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244. https://doi.org/10.1002/cpe.4244 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4244 e4244 cpe.4244.Google ScholarCross Ref
- Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 4 (16 Sep 2009), 291. https://doi.org/10.1007/s00607-009-0066-3Google ScholarDigital Library
- Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Technical Report.Google Scholar
- Maxim Naumov, Patrice Castonguay, and Jonathan Cohen. 2015. Parallel Graph Coloring with Applications to the Incomplete-LU Factorization on the GPU. Technical Report.Google Scholar
- Alan GeorgeMichael T. HeathJoseph LiuEsmond Ng. 1986. Solution of sparse positive definite systems on a shared-memory multiprocessor. International Journal of Parallel Programming Volume 15, Issue 4, pp (1986), 309âĂrŞ325.Google Scholar
- NVIDIA. 2019. NVIDIA cuSPARSE library. https://docs.nvidia.com/cuda/cusparse/index.htmlGoogle Scholar
- Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In Proceedings of the 29th International Conference on Supercomputing - Volume 8488 (ISC 2014). Springer-Verlag New York, Inc., New York, NY, USA, 124--140. https://doi.org/10.1007/978-3-319-07518-1_8Google ScholarDigital Library
- H. Rong, J. Park, L. Xiang, T. A. Anderson, and M. Smelyanskiy. 2016. Sparso: Context-driven optimizations of sparse linear algebra. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). 247--259. https://doi.org/10.1145/2967938.2967943Google ScholarDigital Library
- Edward Rothberg and Anoop Gupta. 1992. Parallel ICCG on a hierarchical memory multiprocessor Addressing the triangular solve bottleneck. Parallel Comput. 18, 7 (1992), 719 -- 741. https://doi.org/10.1016/0167-8191(92)90041-5Google ScholarCross Ref
- Joel H. Saltz. 1990. Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors. SIAM J. Sci. Stat. Comput. 11, 1 (Jan. 1990), 123--144. https://doi.org/10.1137/0911008Google ScholarCross Ref
- Barry Smith and Hong Zhang. 2011. Sparse Triangular Solves for ILU Revisited: Data Layout Crucial to Better Performance. Int. J. High Perform. Comput. Appl. 25, 4 (Nov. 2011), 386--391. https://doi.org/10.1177/1094342010389857Google ScholarDigital Library
- B. Suchoski, C. Severn, M. Shantharam, and P. Raghavan. 2012. Adapting Sparse Triangular Solution to GPUs. In 2012 41st International Conference on Parallel Processing Workshops. 140--148. https://doi.org/10.1109/ICPPW.2012.23Google ScholarDigital Library
- Ehsan Totoni, Michael T. Heath, and Laxmikant V. Kale. 2014. Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Comput. 40, 9 (2014), 454 -- 470. https://doi.org/10.1016/j.parco.2014.06.006Google ScholarDigital Library
- Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. SIGPLAN Not. 53, 1 (Feb. 2018), 338--353. https://doi.org/10.1145/3200691.3178513Google ScholarDigital Library
Index Terms
- Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems
Recommendations
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arraysSparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingSparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing WorkshopsOpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select ...
Comments