skip to main content
research-article

A High Performance QDWH-SVD Solver Using Hardware Accelerators

Published: 13 August 2016 Publication History

Abstract

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4× improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.

References

[1]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In J. Phys.: Conf. Ser., 180 (2009).
[2]
Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J. Dongarra, Jeremy J. Du Croz, Anne Greenbaum, Sven Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User's Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.
[3]
K. S. Arun. 1992. A unitarily constrained total least squares problem in signal processing. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 729--745.
[4]
Grey Ballard, James Demmel, and Ioana Dumitriu. 2010. Minimizing communication for eigenproblems and the singular value decomposition. CoRR abs/1011.3077 (2010). http://arxiv.org/abs/1011.3077.
[5]
I. Y. Bar-Itzhack. 1975. Iterative optimal orthogonalization of the strapdown matrix. IEEE Trans. Aerospace Electron. Syst. AES-11, 1 (Jan. 1975), 30--37.
[6]
Christian H. Bischof, Bruno Lang, and Xiaobai Sun. 2000. Algorithm 807: The SBR toolbox—Software for successive band reduction. ACM Trans. Math. Software 26, 4 (2000), 602--616.
[7]
BLAS. 2013. Basic Linear Algebra Subprograms v3.5. (Nov 2013). Available at http://www.netlib.org/blas/.
[8]
James Demmel and W. Kahan. 1990. Computing small singular values of bidiagonal matrices with guaranteed high relative accuracy. SIAM J. Sci. Statist. Comput. 5 (1990), 873--912.
[9]
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60.
[10]
K. Vince Fernando and Beresford N. Parlett. 1994. Accurate singular values and differential QD algorithms. Num. Math. 67 (1994), 191--229.
[11]
Jerome A. Goldstein and Mel Levy. 1991. Linear algebra and quantum chemistry. Am. Math. Monthly 98, 10 (Oct. 1991), 710--718.
[12]
Gene H. Golub and C. Reinsch. 1970. Singular value decomposition and least squares solutions. Num. Math. 14 (1970), 403--420.
[13]
Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, Maryland.
[14]
Ming Gu and Stanley C. Eisenstat. 1995. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM J. Matrix Anal. Appl. 16, 1 (1995), 79--92.
[15]
Azzam Haidar, Jakub Kurzak, and Piotr Luszczek. 2013. An improved parallel singular value algorithm and its implementation for multicore hardware. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Article 90 (2013), 12 pages.
[16]
Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC'11 Conference on High Performance Computing Networking, Storage and Analysis. ACM, 8.
[17]
Azzam Haidar, Stanimire Tomov, Jack Dongarra, Raffaele Solc, and Thomas C. Schulthess. 2014. A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks. IJHPCA (2014), 196--209.
[18]
Per Christian Hansen. 1998. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. Society for Industrial and Applied Mathematics, Philadelphia. http://books.google.com.sa/books?id=A5XWG\_PFFdcC.
[19]
Nicholas J. Higham and Pythagoras Papadimitriou. 1993. Parallel Singular Value Decomposition via the Polar Decomposition. Numerical Analysis Report No. 239. University of Manchester, England. ftp://vtx.ma.man.ac.uk/pub/narep/narep239.dvi.Z.
[20]
Intel. 2015. Math Kernel Library. (2015). Available at http://software.intel.com/en-us/articles/intel-mkl/.
[21]
Bruno Lang. 1999. Efficient eigenvalue and singular value computations on shared memory machines. Parallel Comput. 25, 7 (1999), 845--860.
[22]
Hatem Ltaief, Jakub Kurzak, and Jack Dongarra. 2010. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems 21, 4 (April 2010).
[23]
Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2011. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In PARCO (Advances in Parallel Computing), Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.), Vol. 22. IOS Press, 397--404. http://dx.doi.org/10.3233/978-1-61499-041-3-397
[24]
Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In Proceedings of IPDPS 2011. ACM.
[25]
MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. (2009). Available at http://icl.cs.utk.edu/magma/.
[26]
Yuji Nakatsukasa, Zhaojun Bai, and Franois Gygi. 2010. Optimizing Halley's iteration for computing the matrix polar decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700--2720.
[27]
Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput. 35, 3 (2013), A1325--A1349.
[28]
Robert Schreiber and Beresford Parlett. 1988. Block reflectors: Theory and computation. SIAM J. Numer. Anal. 25, 1 (1988), 189--205.
[29]
Peter H. Schnemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1 (1966), 1--10.
[30]
Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htm.
[31]
Asim YarKhan, Jakub Kurzak, and Jack Dongarra. 2011. QUARK Users' Guide: QUeueing and Runtime for Kernels. University of Tennessee Innovative Computing Laboratory Technical Report ICL-UT-11-02.

Cited By

View all
  • (2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
  • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
  • (2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
  • Show More Cited By

Index Terms

  1. A High Performance QDWH-SVD Solver Using Hardware Accelerators

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 43, Issue 1
    March 2017
    202 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/2987591
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 August 2016
    Accepted: 01 February 2016
    Revised: 01 February 2016
    Received: 01 June 2015
    Published in TOMS Volume 43, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU-based scientific computing
    2. Singular value decomposition
    3. mixed precision algorithms
    4. polar decomposition
    5. symmetric eigensolver

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)svds-C: A multi-thread C code for computing truncated singular value decompositionSoftwareX10.1016/j.softx.2024.10178127(101781)Online publication date: Sep-2024
    • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
    • (2023)High-Performance SVD Partial Spectrum ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607109(1-12)Online publication date: 12-Nov-2023
    • (2023)Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor CoreProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577516(301-312)Online publication date: 25-Feb-2023
    • (2020)Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU AcceleratorsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404416(1-11)Online publication date: 17-Aug-2020
    • (2020)A literature survey of matrix methods for data scienceGAMM-Mitteilungen10.1002/gamm.20200001343:3Online publication date: 10-Sep-2020
    • (2019)Massively Parallel Polar Decomposition on Distributed-memory SystemsACM Transactions on Parallel Computing10.1145/33287236:1(1-15)Online publication date: 7-Jun-2019
    • (2019)A QDWH-based SVD Software Framework on Distributed-memory Manycore SystemsACM Transactions on Mathematical Software10.1145/330954845:2(1-21)Online publication date: 26-Apr-2019
    • (2019)Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891024(1-12)Online publication date: Sep-2019
    • (2019)A High Performance Implementation of Zolo-SVD algorithm on Distributed Memory SystemsParallel Computing10.1016/j.parco.2019.04.004Online publication date: Apr-2019
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media