Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver

Park, Jongsoo; Smelyanskiy, Mikhail; Sundaram, Narayanan; Dubey, Pradeep

doi:10.1007/978-3-319-07518-1_8

Jongsoo Park¹⁸,
Mikhail Smelyanskiy¹⁸,
Narayanan Sundaram¹⁸ &
…
Pradeep Dubey¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8488))

Included in the following conference series:

International Supercomputing Conference

2976 Accesses
47 Citations
3 Altmetric

Abstract

The last decade has seen rapid growth of single-chip multiprocessors (CMPs), which have been leveraging Moore’s law to deliver high concurrency via increases in the number of cores and vector width. Modern CMPs execute from several hundreds to several thousands concurrent operations per second, while their memory subsystem delivers from tens to hundreds Giga-bytes per second bandwidth.

Taking advantage of these parallel resources requires highly tuned parallel implementations of key computational kernels, which form the back-bone of modern HPC. Sparse triangular solver is one such kernel and is the focus of this paper. It is widely used in several types of sparse linear solvers, and it is commonly considered challenging to parallelize and scale even on a moderate number of cores. This challenge is due to the fact that triangular solver typically has limited task-level parallelism and relies on fine-grain synchronization to exploit this parallelism, compared to data-parallel operations such as sparse matrix-vector multiplication.

This paper presents synchronization sparsification technique that significantly reduces the overhead of synchronization in sparse triangular solver and improves its scalability. We discover that a majority of task dependencies are redundant in task dependency graphs which are used to model the flow of computation in sparse triangular solver. We propose a fast and approximate sparsification algorithm, which eliminates more than 90% of these dependencies, substantially reducing synchronization overhead. As a result, on a 12-core Intel^® Xeon^® processor, our approach improves the performance of sparse triangular solver by 1.6x, compared to the conventional level-scheduling with barrier synchronization. This, in turn, leads to a 1.4x speedup in a pre-conditioned conjugate gradient solver.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

A parallel sparse triangular solve algorithm based on dependency elimination of the solution vector

Article 03 October 2020

Experiments with Sparse Cholesky Using a Parametrized Task Graph Implementation

References

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. Journal of Physics: Conference Series 180 (2009)
Google Scholar
Anderson, E., Saad, Y.: Solving Sparse Triangular Linear Systems on Parallel Computers. International Journal of High Speed Computing 1(1) (1989)
Google Scholar
Chan, E., Quintana-Orti, E.S., Quintana-Orti, G., van de Geijn, R.: SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures. In: Symposium on Parallelism in Algorithms and Architectures (SPAA) (2007)
Google Scholar
Chhugani, J., Satish, N., Kim, C., Sewall, J., Dubey, P.: Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency. In: International Symposium on Parallel and Distributed Processing (IPDPS) (2012)
Google Scholar
Molka, R.S.D., Hackenberg, D., Müller, M.S.: Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In: International Conference on Parallel Architectures and Compilation Techniques (PACT) (2009)
Google Scholar
Davis, T.A., Hu, Y.: The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software 15(1) (2011), http://www.cise.ufl.edu/research/sparse/matrices
Dongarra, J., Heroux, M.A.: Toward a New Metric for Ranking High Performance Computing Systems. Technical Report 4744, Sandia National Laboratories (2013)
Google Scholar
Graham, R.L.: Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics 17(2) (1969)
Google Scholar
Hensgen, D., Finkel, R., Manber, U.: Two Algorithms for Barrier Synchronization. International Journal of Parallel Programming 17(1) (1988)
Google Scholar
Henson, V.E., Yang, U.M.: Boomeramg: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 155–177 (2000)
Article MathSciNet Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 49(6) (1952)
Google Scholar
Hsu, H.T.: An Algorithm for Finding a Minimal Equivalent Graph of a Digraph. Journal of the ACM (JACM) 22(1) (1975)
Google Scholar
Hu, T.C.: Parallel Sequencing and Assembly Line Problems. Operations Research 19(6) (1961)
Google Scholar
Iwashita, T., Nakashima, H., Takahashi, Y.: Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method. In: International Symposium on Parallel and Distributed Processing (IPDPS) (2012)
Google Scholar
Kepner, J., Gilbert, J.: Graph Algorithms in the Language of Linear Algebra. Society for Industrial & Applied Mathematics (2011)
Google Scholar
Kim, K., Eijkhout, V.: A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling. Technical Report 5, The Texas Advanced Computing Center (2012)
Google Scholar
Mayer, J.: Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86(4) (2009)
Google Scholar
Meijerink, J.A., van der Vorst, H.A.: An Iterative Solution Method for Linear Systems of Which the Coefficient Matrix is a Symmetric M-Matrix. Mathematics of Computation 31(137) (1977)
Google Scholar
Naumov, M.: Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Technical Report 001, NVIDIA Corporation (2011)
Google Scholar
Park, J., Dally, W.J.: Buffer-space Efficient and Deadlock-free Scheduling of Stream Applications on Multi-core Architectures. In: Symposium on Parallelism in Algorithms and Architectures (SPAA) (2010)
Google Scholar
Petitet, A., Whaley, R.C., Dongarra, J., Cleary, A.: HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, http://www.netlib.org/benchmark/hpl/
Poole, E.L., Ortega, J.M.: Multicolor ICCG Methods for Vector Computers. SIAM Journal on Numerical Analysis 24(6) (1987)
Google Scholar
Rothberg, E., Gupta, A.: Parallel ICCG on a Hierarchical Memory Multiprocessor - Addressing the Triangular Solve Bottleneck. Parallel Computing 18(7) (1992)
Google Scholar
Saltz, J.H.: Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors. SIAM Journal of Scientific and Statistical Computing 11(1) (1990)
Google Scholar
Saltz, J.H., Mirchandaney, R., Baxter, D.: Run-Time Parallelization and Scheduling of Loops. In: Symposium on Parallelism in Algorithms and Architectures (SPAA) (1989)
Google Scholar
Smith, B., Zhang, H.: Sparse triangular solves for ILU revisited: Data layout crucial to better performance. International Journal of High Performance Computing Applications 25(4), 386–391 (2011)
Article Google Scholar
Wolf, M.M., Heroux, M.A., Boman, E.G.: Factors Impacting Performance of Multithreaded Sparse Triangular Solve. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 32–44. Springer, Heidelberg (2011)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Parallel Computing Lab, Intel Corporation, Santa Clara, CA, USA
Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram & Pradeep Dubey

Authors

Jongsoo Park
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Smelyanskiy
View author publications
You can also search for this author in PubMed Google Scholar
Narayanan Sundaram
View author publications
You can also search for this author in PubMed Google Scholar
Pradeep Dubey
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MIN Faculty, Department of Informatics Scientific Computing, University of Hamburg, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundesstraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, J., Smelyanskiy, M., Sundaram, N., Dubey, P. (2014). Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2014. Lecture Notes in Computer Science, vol 8488. Springer, Cham. https://doi.org/10.1007/978-3-319-07518-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-07518-1_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07517-4
Online ISBN: 978-3-319-07518-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics