ABSTRACT
We introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the algorithms in the desired part of the spectrum only, the QDWHpartial-SVD algorithm efficiently computes a fraction (say 1--20%) of the leading singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD 1 on distributed-memory manycore systems and demonstrate its numerical robustness. We perform a benchmarking campaign against counterparts from the state-of-the-art numerical libraries across various matrix sizes using up to 36K MPI processes. Experimental results show performance speedups for QDWHpartial-SVD up to 6X and 2X against vendor-optimized PDGESVD from ScaLAPACK and KSVD on a Cray XC40 system using 1152 nodes based on two-socket 16-core Intel Haswell CPU, respectively. We also port our QDWHpartial-SVD software library to a system composed of 256 nodes with two-socket 64-Core AMD EPYC Milan CPU and achieve performance speedup up to 4X compared to vendor-optimized PDGESVD from ScaLAPACK. We also compare energy consumption for the two algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD in that regard by performing fewer memory-bound operations.
- Sameh Abdulah, Hatem Ltaief, Ying Sun, Marc G Genton, and David E Keyes. 2018. ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems. IEEE Transactions on Parallel and Distributed Systems 29, 12 (2018), 2771--2784.Google ScholarCross Ref
- Kadir Akbudak, Hatem Ltaief, Aleksandr Mikhalev, Ali Charara, Aniello Esposito, and David Keyes. 2018. Exploiting Data Sparsity for Large-Scale Matrix Computations. In Euro-Par 2018: Parallel Processing, Marco Aldinucci, Luca Padovani, and Massimo Torquati (Eds.), Vol. 11014. Springer International Publishing, Cham, 721--734.Google Scholar
- Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari, Jean-Yves L'Excellent, and Clément Weisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations. SIAM Journal on Scientific Computing 37, 3 (2015), A1451--A1474. Google ScholarDigital Library
- Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J Dongarra, Jeremy J Du Croz, Anne Greenbaum, Sven Hammarling, A McKenney, and Danny C Sorensen. 1999. LAPACK User's Guide (3rd ed.). SIAM, Philadelphia.Google Scholar
- I.Y. Bar-Itzhack. 1975. Iterative Optimal Orthogonalization of the Strapdown Matrix. IEEE Trans. Aerospace Electron. Systems AES-11, 1 (Jan 1975), 30--37. Google ScholarCross Ref
- Christian H. Bischof, Bruno Lang, and Xiaobai Sun. 2000. Algorithm 807: The SBR Toolbox---Software for Successive Band Reduction. ACM Trans. Math. Software 26, 4 (2000), 602--616. Google ScholarDigital Library
- L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D'Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia.Google Scholar
- Qinglei Cao, Sameh Abdulah, Rabab Alomairy, Yu Pei, Pratik Nag, George Bosilca, Jack Dongarra, Marc G. Genton, David E. Keyes, Hatem Ltaief, and Ying Sun. 2022. Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'22). IEEE Press, Dallas, Texas, Article 2, 12 pages.Google ScholarDigital Library
- Lars Eldén. 2007. Matrix Methods in Data Mining and Pattern Recognition. Society for Industrial and Applied Mathematics. x + 224 pages.Google Scholar
- Aniello Esposito, David E. Keyes, Hatem Ltaief, and Dalal Sukkari. 2018. Performance Impact of Rank-Reordering on Advanced Polar Decomposition Algorithms. In Cray Users' Group Conference. http://hdl.handle.net/10754/628026Google Scholar
- Massimiliano Fasi and Nicholas J. Higham. 2021. Generating Extreme-Scale Matrices With Specified Singular Values or Condition Number. SIAM Journal on Scientific Computing 43, 1 (2021), A663--A684. arXiv:https://doi.org/10.1137/20M1327938 Google ScholarDigital Library
- Jerome A. Goldstein and Mel Levy. 1991. Linear Algebra and Quantum Chemistry. Amer. Math. Monthly 98, 10 (Oct. 1991), 710--718. Google ScholarDigital Library
- Gene H. Golub and C. Reinsch. 1970. Singular Value Decomposition and Least Squares Solutions. Numerische Mathematik 14 (1970), 403--420.Google ScholarDigital Library
- Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations (4th ed.). The Johns Hopkins University Press.Google Scholar
- Ming Gu and Stanley C. Eisenstat. 1996. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing 17, 4 (1996), 848--869.Google ScholarDigital Library
- Wolfgang Hackbusch. 2015. Hierarchical Matrices: Algorithms and Analysis. Vol. 49. Springer. Springer Series in Computational Mathematics.Google ScholarDigital Library
- Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems Using Aggregated Finegrained And Memory-aware Kernels. In Proceedings of SC'11 Conference on High Performance Computing Networking, Storage and Analysis. ACM SIGARCH/IEEE Computer Society, Seattle, WA, USA, 8.Google Scholar
- Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2012. Toward a High Performance Tile Divide and Conquer Algorithm for the Dense Symmetric Eigenvalue Problem. SIAM Journal on Scientific Computing 34, 6 (2012), 249--274.Google ScholarDigital Library
- Azzam Haidar, Stanimire Tomov, Jack Dongarra, Raffaele Solcá, and Thomas Schulthess. 2014. A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine-Grained Memory Aware Tasks. The International Journal of High Performance Computing Applications 28, 2 (2014), 196--209. Google ScholarDigital Library
- Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 2 (2011), 217--288.Google ScholarDigital Library
- Nicholas J. Higham. 2008. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. xx+425 pages.Google ScholarDigital Library
- Bruno Lang. 1999. Efficient Eigenvalue and Singular Value Computations On Shared Memory Machines. Parallel Comput. 25, 7 (1999), 845--860.Google ScholarDigital Library
- Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. 2012. Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 661--670.Google Scholar
- Hatem Ltaief, Piotr Luszczek, and Jack Dongarra. 2012. High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures. ACM Trans. Math. Software 39, 3 (2012), 1--22.Google ScholarDigital Library
- Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2011. Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures. In PARCO (Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 397--404. Google ScholarCross Ref
- Hatem Ltaief, Dalal Sukkari, Aniello Esposito, Yuji Nakatsukasa, and David Keyes. 2019. Massively Parallel Polar Decomposition on Distributed-Memory Systems. ACM Transactions on Parallel Computing 6, 1, Article 4 (June 2019), 15 pages. Google ScholarDigital Library
- Hatem Ltaief, Dalal Sukkari, Oliver Guyon, and David Keyes. 2018. Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside Our Solar System. In Proceedings of the Platform for Advanced Scientific Computing Conference (Basel, Switzerland) (PASC'18). ACM, New York, NY, USA, Article 1, 10 pages. Google ScholarDigital Library
- Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. ACM, Anchorage, AK USA, 944--955.Google ScholarDigital Library
- Osni Marques, James Demmel, and Paulo B. Vasconcelos. 2020. Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem. ACM Trans. Math. Software 46, 2, Article 14 (May 2020), 25 pages. Google ScholarDigital Library
- Yuji Nakatsukasa. 2020. Fast and stable randomized low-rank matrix approximation. arXiv:2009.11392 (2020).Google Scholar
- Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. 31, 5 (2010), 2700--2720. arXiv:https://doi.org/10.1137/090774999 Google ScholarDigital Library
- Yuji Nakatsukasa and Roland W. Freund. 2016. Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions. SIAM Rev. 58, 3 (2016), 461--493. Google ScholarDigital Library
- Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325--A1349. Google ScholarCross Ref
- Ivan V. Oseledets and Eugene E. Tyrtyshnikov. 2009. Breaking the Curse of Dimensionality, Or How to Use SVD in Many Dimensions. SIAM Journal on Scientific Computing 31, 5 (Oct. 2009), 3744--3759. Google ScholarDigital Library
- Robert Schreiber and Beresford Parlett. 1988. Block Reflectors: Theory and Computation. SIAM J. Numer. Anal. 25, 1 (1988), 189--205. Google ScholarDigital Library
- Rémi Soummer, Laurent Pueyo, and James Larkin. 2012. Detection and characterization of exoplanets and disks using projections on Karhunen-Loève eigenimages. The Astrophysical Journal Letters 755, 2 (2012), L28.Google ScholarCross Ref
- Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2019. A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems. ACM Trans. Math. Software 45, 2, Article 18 (April 2019), 21 pages. Google ScholarDigital Library
- Dalal Sukkari, Hatem Ltaief, Mathieu Faverge, and David Keyes. 2017. Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2017), 1--1. Google ScholarCross Ref
- Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A High Performance QDWH-SVD Solver Using Hardware Accelerators. ACM Trans. Math. Software 43, 1 (2016), 6:1--6:25. Google ScholarDigital Library
- Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. High Performance Polar Decomposition on Distributed Memory Systems. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24--26, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9833), Pierre-François Dutot and Denis Trystram (Eds.). Springer, 605--616. Google ScholarCross Ref
- Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htmGoogle Scholar
Index Terms
- High-Performance SVD Partial Spectrum Computation
Recommendations
A High Performance QDWH-SVD Solver Using Hardware Accelerators
This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced ...
Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs
Highlights- Accelerates all three phases of the singular value decomposition using a GPU.
- ...
AbstractThe increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value ...
SVD based initialization: A head start for nonnegative matrix factorization
We describe Nonnegative Double Singular Value Decomposition (NNDSVD), a new method designed to enhance the initialization stage of nonnegative matrix factorization (NMF). NNDSVD can readily be combined with existing NMF algorithms. The basic algorithm ...
Comments