research-article

Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators

Authors:

Mohammed Al Farhan,

Jack DongarraAuthors Info & Claims

SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1680 - 1687

https://doi.org/10.1145/3624062.3624248

Published: 12 November 2023 Publication History

Abstract

We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies.

We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.

Supplemental Material

MP4 File

Recording of "Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators" presentation at ScalAH'23.

Download
177.18 MB

References

[1]

2018. The Chameleon Project. http://project.inria.fr/.

[2]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, Vol. 180.

[3]

Edward Anderson, Zhaojun Bai, Christian Heinrich Bischof, Laura Susan Blackford, James Weldon Demmel, Jack J Dongarra, Jeremy J Du Croz, Anne Greenbaum, Sven Hammarling, A McKenney, and Danny C Sorensen. 1999. LAPACK User’s Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia.

[4]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187–198.

Digital Library

[5]

I. Bar-Itzhack. 1975. Iterative Optimal Orthogonalization of the Strapdown Matrix. IEEE Transactions on Aerospace Electronic Systems 11 (Jan. 1975), 30–37. https://doi.org/10.1109/TAES.1975.308025

[6]

L. Suzan Blackford, J. Choi, Andy Cleary, Eduardo F. D’Azevedo, James W. Demmel, Inderjit S. Dhillon, Jack J. Dongarra, Sven Hammarling, Greg Henry, Antoine Petitet, Ken Stanley, David W. Walker, and R. Clint Whaley. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.

[7]

Ralph Byers and Hongguo Xu. 2008. A New Scaling for Newton’s Iteration for the Polar Decomposition and its Backward Stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822–843. http://dx.doi.org/10.1137/070699895

Digital Library

[8]

Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multicore architectures. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures (San Diego, California, USA). ACM, New York, NY, USA, 116–125. https://doi.org/10.1145/1248377.1248397

Digital Library

[9]

Cray. [n. d.]. LibSci. http://docs.cray.com

[10]

Anthony Danalis, George Bosilca, Aurelien Bouteiller, Thomas Herault, and Jack Dongarra. 2014. PTG: An abstraction for unhindered parallelism. Proceedings of WOLFHPC 2014: 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Stor (2014), 21–30. https://doi.org/10.1109/WOLFHPC.2014.8

[11]

Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2018. The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale. SIAM Rev. 60, 4 (2018), 808–865. https://doi.org/10.1137/17M1117732

Digital Library

[12]

Walter Gander. 1985. On Halley’s iteration method. Amer. Math. Monthly 92, 2 (1985), 131–134.

[13]

Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. 2019. SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 26, 18 pages. https://doi.org/10.1145/3295500.3356223

Digital Library

[14]

Mark Gates, Asim YarKhan, Dalal Sukkari, Kadir Akbudak, Sebastien Cayrols, Daniel Bielich, and Ahmad Abdelfattah. 2022. Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era. In 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 36–46. https://doi.org/10.1109/P3HPC56579.2022.00009

[15]

Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (third ed.). Johns Hopkins University Press, Baltimore, Maryland.

Digital Library

[16]

William W. Hager. 1984. Condition Estimates. SIAM J. Sci. Statist. Comput. 5, 2 (1984), 311–316. https://doi.org/10.1137/0905023 arXiv:https://doi.org/10.1137/0905023

Digital Library

[17]

Nicholas J. Higham. 1992. Estimating the matrix p-norm. Numer. Math. 62, 1 (01 Dec. 1992), 539–555. https://doi.org/10.1007/BF01396242

Digital Library

[18]

Nicholas J. Higham and Pythagoras Papadimitriou. 1993. Parallel Singular Value Decomposition via the Polar Decomposition. Numerical Analysis Report No. 239. University of Manchester, England. ftp://vtx.ma.man.ac.uk/pub/narep/narep239.dvi.Z

[19]

Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A New Parallel Algorithm for Computing the Singular Value Decomposition. In Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, John G. Lewis (Ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 80–84.

[20]

Nicholas J. Higham and Pythagoras Papadimitriou. 1994. A Parallel Algorithm for Computing the Polar Decomposition. Parallel Comput. 20, 8 (Aug. 1994), 1161–1173.

Digital Library

[21]

Charles Kenney and Alan J. Laub. 1992. On Scaling Newton’s Method for Polar Decomposition and the Matrix Sign Function. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 688–706. https://doi.org/10.1137/0613044 arXiv:http://dx.doi.org/10.1137/0613044

Digital Library

[22]

Charles Kenney and Alan J. Laub. 1992. On Scaling Newton’s Method for Polar Decomposition and the Matrix Sign Function. SIAM J. Matrix Anal. Appl. 13, 3 (1992), 688–706. https://doi.org/10.1137/0613044 arXiv:https://doi.org/10.1137/0613044

Digital Library

[23]

Andrzej Kielbasinski and Krystyna Zietak. 2003. Numerical Behaviour of Higham’s Scaled Method for Polar Decomposition. Numerical Algorithms 32, 2-4 (2003), 105–140. http://dx.doi.org/10.1023/A:1024098014869

[24]

B. Laszkiewicz and K. Zietak. 2006. Approximation of Matrices and a Family of Gander Methods for Polar Decomposition. BIT Numerical Mathematics 46, 2 (2006), 345–366. http://dx.doi.org/10.1007/s10543-006-0053-4

Digital Library

[25]

Hatem Ltaief, Dalal Sukkari, Aniello Esposito, Yuji Nakatsukasa, and David Keyes. 2019. Massively Parallel Polar Decomposition on Distributed-Memory Systems. ACM Trans. Parallel Comput. 6, 1, Article 4 (jun 2019), 15 pages. https://doi.org/10.1145/3328723

Digital Library

[26]

Hatem Ltaief, Dalal Sukkari, Oliver Guyon, and David Keyes. 2018. Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside Our Solar System. In PASC 2018: Proceedings of the Platform for Advanced Scientific Computing Conference (Basel, Switzerland). ACM, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3218176.3218225 Best Paper.

Digital Library

[27]

MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. Available at http://icl.cs.utk.edu/magma/.

[28]

Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700–2720.

[29]

Yuji Nakatsukasa, Zhaojun Bai, and François Gygi. 2010. Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM J. Matrix Anal. Appl. (2010), 2700–2720.

[30]

Yuji Nakatsukasa and Nicholas J. Higham. 2012. Backward Stability of Iterations for Computing the Polar Decomposition. SIAM J. Matrix Anal. Appl. 33, 2 (2012), 460–479. https://doi.org/10.1137/110857544 arXiv:https://doi.org/10.1137/110857544

Digital Library

[31]

Yuji Nakatsukasa and Nicholas J. Higham. 2013. Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325–A1349. https://doi.org/10.1137/120876605 arXiv:http://epubs.siam.org/doi/pdf/10.1137/120876605

Digital Library

[32]

Oak Ridge Leadership Computing Facility (OLCF). 2023. Frontier User Guide. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

[33]

Oak Ridge Leadership Computing Facility (OLCF). 2023. Summit User Guide. https://docs.olcf.ornl.gov/systems/summit_user_guide.html

[34]

J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. 2013. Elemental: A New Framework for Distributed Memory Dense Matrix Computations. ACM Trans. Math. Software 39, 2 (2013), 13. http://doi.acm.org/10.1145/2427023.2427030

Digital Library

[35]

PeterH. Schönemann. 1966. A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1 (1966), 1–10. https://doi.org/10.1007/BF02289451

[36]

Dalal Sukkari. 2019. High Performance Polar Decomposition on Manycore Systems and its application to Symmetric Eigensolvers and the Singular Value Decomposition. Ph. D. Dissertation. KAUST. https://doi.org/10.25781/KAUST-R20B1

[37]

Dalal Sukkari, Hatem Ltaief, Aniello Esposito, and David Keyes. 2019. A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems. ACM Trans. Math. Softw. 45, 2, Article 18 (April 2019), 21 pages. https://doi.org/10.1145/3309548

Digital Library

[38]

D. Sukkari, H. Ltaief, M. Faverge, and D. Keyes. 2018. Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems 29, 2 (Feb 2018), 312–323. https://doi.org/10.1109/TPDS.2017.2755655

[39]

Dalal Sukkari, Hatem Ltaief, and David Keyes. 2016. High Performance Polar Decomposition on Distributed Memory Systems. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings(Lecture Notes in Computer Science, Vol. 9833), Pierre-François Dutot and Denis Trystram (Eds.). Springer, 605–616. http://dx.doi.org/10.1007/978-3-319-43659-3

Digital Library

[40]

D. Sukkari, H. Ltaief, D. Keyes, and M. Faverge. 2019. Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). 1–12.

[41]

Dalal Sukkari, Hatem Ltaief, and David E. Keyes. 2016. A High Performance QDWH-SVD Solver Using Hardware Accelerators. ACM Trans. Math. Softw 43, 1 (2016), 6:1–6:25. http://doi.acm.org/10.1145/2894747

Digital Library

[42]

Lloyd N. Trefethen and David Bau. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA. http://www.siam.org/books/OT50/Index.htm

Index Terms

Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
    2. Solvers

Recommendations

A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar ...
A High Performance QDWH-SVD Solver Using Hardware Accelerators

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced ...
Massively Parallel Polar Decomposition on Distributed-memory Systems

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

November 2023

2180 pages

ISBN:9798400707858

DOI:10.1145/3624062

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Exascale Computing Project

Conference

SC-W 2023

SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

November 12 - 17, 2023

CO, Denver, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
78
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten