skip to main content
10.1145/3624062.3624248acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators

Published: 12 November 2023 Publication History

Abstract

We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies.
We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.

Supplemental Material

MP4 File
Recording of "Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators" presentation at ScalAH'23.

References

[1]
2018. The Chameleon Project. http://project.inria.fr/.
[2]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, Vol. 180.
[3]
Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J Dongarra, Jeremy J Du Croz, Anne Greenbaum, Sven Hammarling, A McKenney, and Danny C Sorensen. 1999. LAPACK User’s Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.
[4]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187–198.
[5]
I. Bar-Itzhack. 1975. Iterative Optimal Orthogonalization of the Strapdown Matrix. IEEE Transactions on Aerospace Electronic Systems 11 (Jan. 1975), 30–37. https://doi.org/10.1109/TAES.1975.308025
[6]
L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.
[7]
Ralph Byers and Hongguo Xu. 2008. A New Scaling for Newton’s Iteration for the Polar Decomposition and its Backward Stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822–843. http://dx.doi.org/10.1137/070699895
[8]
Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multicore architectures. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures (San Diego, California, USA). ACM, New York, NY, USA, 116–125. https://doi.org/10.1145/1248377.1248397
[9]
Cray. [n. d.]. LibSci. http://docs.cray.com
[10]
Anthony Danalis, George Bosilca, Aurelien Bouteiller, Thomas Herault, and Jack Dongarra. 2014. PTG: An abstraction for unhindered parallelism. Proceedings of WOLFHPC 2014: 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Stor (2014), 21–30. https://doi.org/10.1109/WOLFHPC.2014.8
[11]
Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2018. The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale. SIAM Rev. 60, 4 (2018), 808–865. https://doi.org/10.1137/17M1117732
[12]
Walter Gander. 1985. On Halley’s iteration method. Amer. Math. Monthly 92, 2 (1985), 131–134.
[13]
Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. 2019. SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 26, 18 pages. https://doi.org/10.1145/3295500.3356223
[14]
Mark Gates, Asim YarKhan, Dalal Sukkari, Kadir Akbudak, Sebastien Cayrols, Daniel Bielich, and Ahmad Abdelfattah. 2022. Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era. In 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 36–46. https://doi.org/10.1109/P3HPC56579.2022.00009
[15]
Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (third ed.). Johns Hopkins University Press, Baltimore, Maryland.
[16]
William W. Hager. 1984. Condition Estimates. SIAM J. Sci. Statist. Comput. 5, 2 (1984), 311–316. https://doi.org/10.1137/0905023 arXiv:https://doi.org/10.1137/0905023
[17]
Nicholas J. Higham. 1992. Estimating the matrix p-norm. Numer. Math. 62, 1 (01 Dec. 1992), 539–555. https://doi.org/10.1007/BF01396242
[18]
Nicholas J. Higham and Pythagoras Papadimitriou. 1993. Parallel Singular Value Decomposition via the Polar Decomposition. Numerical Analysis Report No. 239. University of Manchester, England. ftp://vtx.ma.man.ac.uk/pub/narep/narep239.dvi.Z
[19]
Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A New Parallel Algorithm for Computing the Singular Value Decomposition. In Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, John G. Lewis (Ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 80–84.
[20]
Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A Parallel Algorithm for Computing the Polar Decomposition. Parallel Comput. 20, 8 (Aug. 1994), 1161–1173.
[21]
Charles Kenney and Alan J. Laub. 1992. On Scaling Newton’s Method for Polar Decomposition and the Matrix Sign Function. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 688–706. https://doi.org/10.1137/0613044 arXiv:http://dx.doi.org/10.1137/0613044
[22]
Charles Kenney and Alan J. Laub. 1992. On Scaling Newton’s Method for Polar Decomposition and the Matrix Sign Function. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 688–706. https://doi.org/10.1137/0613044 arXiv:https://doi.org/10.1137/0613044
[23]
Andrzej Kielbasinski and Krystyna Zietak. 2003. Numerical Behaviour of Higham’s Scaled Method for Polar Decomposition. Numerical Algorithms 32, 2-4 (2003), 105–140. http://dx.doi.org/10.1023/A:1024098014869
[24]
B. Laszkiewicz and K. Zietak. 2006. Approximation of Matrices and a Family of Gander Methods for Polar Decomposition. BIT Numerical Mathematics 46, 2 (2006), 345–366. http://dx.doi.org/10.1007/s10543-006-0053-4
[25]
Hatem Ltaief, Dalal Sukkari, Aniello Esposito, Yuji Nakatsukasa, and David Keyes. 2019. Massively Parallel Polar Decomposition on Distributed-Memory Systems. ACM Trans. Parallel Comput. 6, 1, Article 4 (jun 2019), 15 pages. https://doi.org/10.1145/3328723
[26]
Hatem Ltaief, Dalal Sukkari, Oliver Guyon, and David Keyes. 2018. Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside Our Solar System. In PASC 2018: Proceedings of the Platform for Advanced Scientific Computing Conference (Basel, Switzerland). ACM, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3218176.3218225 Best Paper.
[27]
MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. Available at http://icl.cs.utk.edu/magma/.
[28]
Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700–2720.
[29]
Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700–2720.
[30]
Yuji Nakatsukasa and Nicholas J. Higham. 2012. Backward Stability of Iterations for Computing the Polar Decomposition. SIAM J. Matrix Anal. Appl. 33, 2 (2012), 460–479. https://doi.org/10.1137/110857544 arXiv:https://doi.org/10.1137/110857544
[31]
Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325–A1349. https://doi.org/10.1137/120876605 arXiv:http://epubs.siam.org/doi/pdf/10.1137/120876605
[32]
Oak Ridge Leadership Computing Facility (OLCF). 2023. Frontier User Guide. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
[33]
Oak Ridge Leadership Computing Facility (OLCF). 2023. Summit User Guide. https://docs.olcf.ornl.gov/systems/summit_user_guide.html
[34]
J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. 2013. Elemental: A New Framework for Distributed Memory Dense Matrix Computations. ACM Trans. Math. Software 39, 2 (2013), 13. http://doi.acm.org/10.1145/2427023.2427030
[35]
PeterH. Schönemann. 1966. A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1 (1966), 1–10. https://doi.org/10.1007/BF02289451
[36]
Dalal Sukkari. 2019. High Performance Polar Decomposition on Manycore Systems and its application to Symmetric Eigensolvers and the Singular Value Decomposition. Ph. D. Dissertation. KAUST. https://doi.org/10.25781/KAUST-R20B1
[37]
Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2019. A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems. ACM Trans. Math. Softw. 45, 2, Article 18 (April 2019), 21 pages. https://doi.org/10.1145/3309548
[38]
D. Sukkari, H. Ltaief, M. Faverge, and D. Keyes. 2018. Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems 29, 2 (Feb 2018), 312–323. https://doi.org/10.1109/TPDS.2017.2755655
[39]
Dalal Sukkari, Hatem Ltaief, and David Keyes. 2016. High Performance Polar Decomposition on Distributed Memory Systems. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings(Lecture Notes in Computer Science, Vol. 9833), Pierre-François Dutot and Denis Trystram (Eds.). Springer, 605–616. http://dx.doi.org/10.1007/978-3-319-43659-3
[40]
D. Sukkari, H. Ltaief, D. Keyes, and M. Faverge. 2019. Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). 1–12.
[41]
Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A High Performance QDWH-SVD Solver Using Hardware Accelerators. ACM Trans. Math. Softw 43, 1 (2016), 6:1–6:25. http://doi.acm.org/10.1145/2894747
[42]
Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htm

Index Terms

  1. Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis
      November 2023
      2180 pages
      ISBN:9798400707858
      DOI:10.1145/3624062
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Linear algebra
      2. QDWH
      3. polar decomposition

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Exascale Computing Project

      Conference

      SC-W 2023

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 78
        Total Downloads
      • Downloads (Last 12 months)36
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media