research-article

A High Performance QDWH-SVD Solver Using Hardware Accelerators

Authors:

David KeyesAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 1

Article No.: 6, Pages 1 - 25

https://doi.org/10.1145/2894747

Published: 13 August 2016 Publication History

Abstract

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4× improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.

References

[1]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In J. Phys.: Conf. Ser., 180 (2009).

[2]

Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J. Dongarra, Jeremy J. Du Croz, Anne Greenbaum, Sven Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User's Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.

Digital Library

[3]

K. S. Arun. 1992. A unitarily constrained total least squares problem in signal processing. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 729--745.

Digital Library

[4]

Grey Ballard, James Demmel, and Ioana Dumitriu. 2010. Minimizing communication for eigenproblems and the singular value decomposition. CoRR abs/1011.3077 (2010). http://arxiv.org/abs/1011.3077.

[5]

I. Y. Bar-Itzhack. 1975. Iterative optimal orthogonalization of the strapdown matrix. IEEE Trans. Aerospace Electron. Syst. AES-11, 1 (Jan. 1975), 30--37.

[6]

Christian H. Bischof, Bruno Lang, and Xiaobai Sun. 2000. Algorithm 807: The SBR toolbox—Software for successive band reduction. ACM Trans. Math. Software 26, 4 (2000), 602--616.

Digital Library

[7]

BLAS. 2013. Basic Linear Algebra Subprograms v3.5. (Nov 2013). Available at http://www.netlib.org/blas/.

[8]

James Demmel and W. Kahan. 1990. Computing small singular values of bidiagonal matrices with guaranteed high relative accuracy. SIAM J. Sci. Statist. Comput. 5 (1990), 873--912.

[9]

Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60.

Digital Library

[10]

K. Vince Fernando and Beresford N. Parlett. 1994. Accurate singular values and differential QD algorithms. Num. Math. 67 (1994), 191--229.

[11]

Jerome A. Goldstein and Mel Levy. 1991. Linear algebra and quantum chemistry. Am. Math. Monthly 98, 10 (Oct. 1991), 710--718.

Digital Library

[12]

Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Num. Math. 14 (1970), 403--420.

Digital Library

[13]

Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, Maryland.

Digital Library

[14]

Ming Gu and Stanley C. Eisenstat. 1995. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Matrix Anal. Appl. 16, 1 (1995), 79--92.

Digital Library

[15]

Azzam Haidar, Jakub Kurzak, and Piotr Luszczek. 2013. An improved parallel singular value algorithm and its implementation for multicore hardware. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Article 90 (2013), 12 pages.

Digital Library

[16]

Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC'11 Conference on High Performance Computing Networking, Storage and Analysis. ACM, 8.

Digital Library

[17]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, Raffaele Solc, and Thomas C. Schulthess. 2014. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks. IJHPCA (2014), 196--209.

Digital Library

[18]

Per Christian Hansen. 1998. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. Society for Industrial and Applied Mathematics, Philadelphia. http://books.google.com.sa/books?id=A5XWG\_PFFdcC.

Digital Library

[19]

Nicholas J. Higham and Pythagoras Papadimitriou. 1993. Parallel Singular Value Decomposition via the Polar Decomposition. Numerical Analysis Report No. 239. University of Manchester, England. ftp://vtx.ma.man.ac.uk/pub/narep/narep239.dvi.Z.

[20]

Intel. 2015. Math Kernel Library. (2015). Available at http://software.intel.com/en-us/articles/intel-mkl/.

[21]

Bruno Lang. 1999. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Comput. 25, 7 (1999), 845--860.

Digital Library

[22]

Hatem Ltaief, Jakub Kurzak, and Jack Dongarra. 2010. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems 21, 4 (April 2010).

Digital Library

[23]

Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2011. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In PARCO (Advances in Parallel Computing), Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.), Vol. 22. IOS Press, 397--404. http://dx.doi.org/10.3233/978-1-61499-041-3-397

[24]

Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of IPDPS 2011. ACM.

Digital Library

[25]

MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. (2009). Available at http://icl.cs.utk.edu/magma/.

[26]

Yuji Nakatsukasa, Zhaojun Bai, and Franois Gygi. 2010. Optimizing Halley's iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700--2720.

Digital Library

[27]

Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.

Digital Library

[28]

Robert Schreiber and Beresford Parlett. 1988. Block reflectors: Theory and computation. SIAM J. Numer. Anal. 25, 1 (1988), 189--205.

Digital Library

[29]

Peter H. Schnemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1 (1966), 1--10.

[30]

Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htm.

[31]

Asim YarKhan, Jakub Kurzak, and Jack Dongarra. 2011. QUARK Users' Guide: QUeueing and Runtime for Kernels. University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02.

Cited By

Feng XYu WXie Y(2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
https://doi.org/10.1016/j.softx.2024.101781
Sukkari DGates MAl Farhan MAnzt HDongarra J(2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624248
Keyes DLtaief HNakatsukasa YSukkari DMohror KArnold DBadia R(2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607109
Show More Cited By

Index Terms

A High Performance QDWH-SVD Solver Using Hardware Accelerators
1. Mathematics of computing
  1. Mathematical software

Recommendations

High-Performance SVD Partial Spectrum Computation
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the ...
Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem

The Singular Value Decomposition (SVD) is widely used in numerical analysis and scientific computing applications, including dimensionality reduction, data compression and clustering, and computation of pseudo-inverses. In many cases, a crucial part of ...
A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 43, Issue 1

March 2017

202 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2987591

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Accepted: 01 February 2016

Revised: 01 February 2016

Received: 01 June 2015

Published in TOMS Volume 43, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
319
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Feng XYu WXie Y(2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
https://doi.org/10.1016/j.softx.2024.101781
Sukkari DGates MAl Farhan MAnzt HDongarra J(2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624248
Keyes DLtaief HNakatsukasa YSukkari DMohror KArnold DBadia R(2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607109
Zhang SShah ROotomo HYokota RWu PDehnavi MKulkarni MKrishnamoorthy S(2023)Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor CoreProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577516(301-312)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577516
Williams-Young DYang C(2020)Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU AcceleratorsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404416(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404416
Stoll M(2020)A literature survey of matrix methods for data scienceGAMM-Mitteilungen10.1002/gamm.20200001343:3Online publication date: 10-Sep-2020
https://doi.org/10.1002/gamm.202000013
Ltaief HSukkari DEsposito ANakatsukasa YKeyes D(2019)Massively Parallel Polar Decomposition on Distributed-memory SystemsACM Transactions on Parallel Computing10.1145/33287236:1(1-15)Online publication date: 7-Jun-2019
https://dl.acm.org/doi/10.1145/3328723
Sukkari DLtaief HEsposito AKeyes D(2019)A QDWH-based SVD Software Framework on Distributed-memory Manycore SystemsACM Transactions on Mathematical Software10.1145/330954845:2(1-21)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3309548
Sukkari DLtaief HKeyes DFaverge M(2019)Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891024(1-12)Online publication date: Sep-2019
https://doi.org/10.1109/CLUSTER.2019.8891024
Li SLiu JDu Y(2019)A High Performance Implementation of Zolo-SVD algorithm on Distributed Memory SystemsParallel Computing10.1016/j.parco.2019.04.004Online publication date: Apr-2019
https://doi.org/10.1016/j.parco.2019.04.004
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents