High-Performance SVD Partial Spectrum Computation

Authors:
David Keyes

King Abdullah University of Science & Technology, Thuwal, Saudi Arabia

King Abdullah University of Science & Technology, Thuwal, Saudi Arabia

https://orcid.org/0000-0002-4052-7224
View Profile

,
Hatem Ltaief

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

https://orcid.org/0000-0002-6897-1095
View Profile

,
Yuji Nakatsukasa

Mathematical Institute University of Oxford, Oxford, United Kingdom

Mathematical Institute University of Oxford, Oxford, United Kingdom

https://orcid.org/0000-0001-7911-1501
View Profile

,
Dalal Sukkari

Innovative Computing Laboratory University of Tennessee, Knoxville, United States of America

Innovative Computing Laboratory University of Tennessee, Knoxville, United States of America

https://orcid.org/0000-0002-4228-4211
View Profile

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023Article No.: 74Pages 1–12https://doi.org/10.1145/3581784.3607109

Published:11 November 2023Publication History

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

We introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the algorithms in the desired part of the spectrum only, the QDWHpartial-SVD algorithm efficiently computes a fraction (say 1--20%) of the leading singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD ¹ on distributed-memory manycore systems and demonstrate its numerical robustness. We perform a benchmarking campaign against counterparts from the state-of-the-art numerical libraries across various matrix sizes using up to 36K MPI processes. Experimental results show performance speedups for QDWHpartial-SVD up to 6X and 2X against vendor-optimized PDGESVD from ScaLAPACK and KSVD on a Cray XC40 system using 1152 nodes based on two-socket 16-core Intel Haswell CPU, respectively. We also port our QDWHpartial-SVD software library to a system composed of 256 nodes with two-socket 64-Core AMD EPYC Milan CPU and achieve performance speedup up to 4X compared to vendor-optimized PDGESVD from ScaLAPACK. We also compare energy consumption for the two algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD in that regard by performing fewer memory-bound operations.

References

Sameh Abdulah, Hatem Ltaief, Ying Sun, Marc G Genton, and David E Keyes. 2018. ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems. IEEE Transactions on Parallel and Distributed Systems 29, 12 (2018), 2771--2784.Google ScholarCross Ref
Kadir Akbudak, Hatem Ltaief, Aleksandr Mikhalev, Ali Charara, Aniello Esposito, and David Keyes. 2018. Exploiting Data Sparsity for Large-Scale Matrix Computations. In Euro-Par 2018: Parallel Processing, Marco Aldinucci, Luca Padovani, and Massimo Torquati (Eds.), Vol. 11014. Springer International Publishing, Cham, 721--734.Google Scholar
Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari, Jean-Yves L'Excellent, and Clément Weisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations. SIAM Journal on Scientific Computing 37, 3 (2015), A1451--A1474. Google ScholarDigital Library
Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J Dongarra, Jeremy J Du Croz, Anne Greenbaum, Sven Hammarling, A McKenney, and Danny C Sorensen. 1999. LAPACK User's Guide (3rd ed.). SIAM, Philadelphia.Google Scholar
I.Y. Bar-Itzhack. 1975. Iterative Optimal Orthogonalization of the Strapdown Matrix. IEEE Trans. Aerospace Electron. Systems AES-11, 1 (Jan 1975), 30--37. Google ScholarCross Ref
Christian H. Bischof, Bruno Lang, and Xiaobai Sun. 2000. Algorithm 807: The SBR Toolbox---Software for Successive Band Reduction. ACM Trans. Math. Software 26, 4 (2000), 602--616. Google ScholarDigital Library
L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D'Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia.Google Scholar
Qinglei Cao, Sameh Abdulah, Rabab Alomairy, Yu Pei, Pratik Nag, George Bosilca, Jack Dongarra, Marc G. Genton, David E. Keyes, Hatem Ltaief, and Ying Sun. 2022. Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'22). IEEE Press, Dallas, Texas, Article 2, 12 pages.Google ScholarDigital Library
Lars Eldén. 2007. Matrix Methods in Data Mining and Pattern Recognition. Society for Industrial and Applied Mathematics. x + 224 pages.Google Scholar
Aniello Esposito, David E. Keyes, Hatem Ltaief, and Dalal Sukkari. 2018. Performance Impact of Rank-Reordering on Advanced Polar Decomposition Algorithms. In Cray Users' Group Conference. http://hdl.handle.net/10754/628026Google Scholar
Massimiliano Fasi and Nicholas J. Higham. 2021. Generating Extreme-Scale Matrices With Specified Singular Values or Condition Number. SIAM Journal on Scientific Computing 43, 1 (2021), A663--A684. arXiv:https://doi.org/10.1137/20M1327938 Google ScholarDigital Library
Jerome A. Goldstein and Mel Levy. 1991. Linear Algebra and Quantum Chemistry. Amer. Math. Monthly 98, 10 (Oct. 1991), 710--718. Google ScholarDigital Library
Gene H. Golub and C. Reinsch. 1970. Singular Value Decomposition and Least Squares Solutions. Numerische Mathematik 14 (1970), 403--420.Google ScholarDigital Library
Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations (4th ed.). The Johns Hopkins University Press.Google Scholar
Ming Gu and Stanley C. Eisenstat. 1996. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing 17, 4 (1996), 848--869.Google ScholarDigital Library
Wolfgang Hackbusch. 2015. Hierarchical Matrices: Algorithms and Analysis. Vol. 49. Springer. Springer Series in Computational Mathematics.Google ScholarDigital Library
Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems Using Aggregated Finegrained And Memory-aware Kernels. In Proceedings of SC'11 Conference on High Performance Computing Networking, Storage and Analysis. ACM SIGARCH/IEEE Computer Society, Seattle, WA, USA, 8.Google Scholar
Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2012. Toward a High Performance Tile Divide and Conquer Algorithm for the Dense Symmetric Eigenvalue Problem. SIAM Journal on Scientific Computing 34, 6 (2012), 249--274.Google ScholarDigital Library
Azzam Haidar, Stanimire Tomov, Jack Dongarra, Raffaele Solcá, and Thomas Schulthess. 2014. A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine-Grained Memory Aware Tasks. The International Journal of High Performance Computing Applications 28, 2 (2014), 196--209. Google ScholarDigital Library
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 2 (2011), 217--288.Google ScholarDigital Library
Nicholas J. Higham. 2008. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. xx+425 pages.Google ScholarDigital Library
Bruno Lang. 1999. Efficient Eigenvalue and Singular Value Computations On Shared Memory Machines. Parallel Comput. 25, 7 (1999), 845--860.Google ScholarDigital Library
Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. 2012. Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 661--670.Google Scholar
Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. 2012. High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. ACM Trans. Math. Software 39, 3 (2012), 1--22.Google ScholarDigital Library
Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2011. Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures. In PARCO (Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 397--404. Google ScholarCross Ref
Hatem Ltaief, Dalal Sukkari, Aniello Esposito, Yuji Nakatsukasa, and David Keyes. 2019. Massively Parallel Polar Decomposition on Distributed-Memory Systems. ACM Transactions on Parallel Computing 6, 1, Article 4 (June 2019), 15 pages. Google ScholarDigital Library
Hatem Ltaief, Dalal Sukkari, Oliver Guyon, and David Keyes. 2018. Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside Our Solar System. In Proceedings of the Platform for Advanced Scientific Computing Conference (Basel, Switzerland) (PASC'18). ACM, New York, NY, USA, Article 1, 10 pages. Google ScholarDigital Library
Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. ACM, Anchorage, AK USA, 944--955.Google ScholarDigital Library
Osni Marques, James Demmel, and Paulo B. Vasconcelos. 2020. Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem. ACM Trans. Math. Software 46, 2, Article 14 (May 2020), 25 pages. Google ScholarDigital Library
Yuji Nakatsukasa. 2020. Fast and stable randomized low-rank matrix approximation. arXiv:2009.11392 (2020).Google Scholar
Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720. arXiv:https://doi.org/10.1137/090774999 Google ScholarDigital Library
Yuji Nakatsukasa and Roland W. Freund. 2016. Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions. SIAM Rev. 58, 3 (2016), 461--493. Google ScholarDigital Library
Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325--A1349. Google ScholarCross Ref
Ivan V. Oseledets and Eugene E. Tyrtyshnikov. 2009. Breaking the Curse of Dimensionality, Or How to Use SVD in Many Dimensions. SIAM Journal on Scientific Computing 31, 5 (Oct. 2009), 3744--3759. Google ScholarDigital Library
Robert Schreiber and Beresford Parlett. 1988. Block Reflectors: Theory and Computation. SIAM J. Numer. Anal. 25, 1 (1988), 189--205. Google ScholarDigital Library
Rémi Soummer, Laurent Pueyo, and James Larkin. 2012. Detection and characterization of exoplanets and disks using projections on Karhunen-Loève eigenimages. The Astrophysical Journal Letters 755, 2 (2012), L28.Google ScholarCross Ref
Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2019. A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems. ACM Trans. Math. Software 45, 2, Article 18 (April 2019), 21 pages. Google ScholarDigital Library
Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, and David Keyes. 2017. Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2017), 1--1. Google ScholarCross Ref
Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A High Performance QDWH-SVD Solver Using Hardware Accelerators. ACM Trans. Math. Software 43, 1 (2016), 6:1--6:25. Google ScholarDigital Library
Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. High Performance Polar Decomposition on Distributed Memory Systems. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24--26, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9833), Pierre-François Dutot and Denis Trystram (Eds.). Springer, 605--616. Google ScholarCross Ref
Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htmGoogle Scholar

Index Terms

High-Performance SVD Partial Spectrum Computation
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
    2. Solvers

Recommendations

A High Performance QDWH-SVD Solver Using Hardware Accelerators

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced ...
Read More
Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs
Highlights
- Accelerates all three phases of the singular value decomposition using a GPU.
- ...
Abstract
The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value ...
Read More
SVD based initialization: A head start for nonnegative matrix factorization

We describe Nonnegative Double Singular Value Decomposition (NNDSVD), a new method designed to enhance the initialization stage of nonnegative matrix factorization (NMF). NNDSVD can readily be combined with existing NMF algorithms. The basic algorithm ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror
Copyright © 2023 Owner/Author(s)
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 November 2023
Check for updates
Author Tags
singular value decomposition
partial spectrum
parallel numerical algorithms
distributed-memory systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 167
  Total Downloads
- Downloads (Last 12 months)167
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High-Performance SVD Partial Spectrum Computation

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

A High Performance QDWH-SVD Solver Using Hardware Accelerators

Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs

SVD based initialization: A head start for nonnegative matrix factorization