Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

Rizwan, Muhammad; Jung, Enoch; Choi, Jongsun; Choi, Jaeyoung

doi:10.1007/s11227-024-06002-2

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

Published: 13 March 2024

Volume 80, pages 13813–13836, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Muhammad Rizwan¹,
Enoch Jung¹,
Jongsun Choi¹ &
…
Jaeyoung Choi¹

387 Accesses
Explore all metrics

Abstract

This study focused on the optimization of double-precision general matrix–matrix multiplication (DGEMM) routine to improve the QR factorization performance. By replacing the MKL DGEMM with our previously developed blocked matrix–matrix multiplication routine, we found that the QR factorization performance was suboptimal due to a bottleneck in the $A^{\rm{T}} \cdot B$ matrix–panel multiplication operation. We present an investigation of the limitations of our matrix–matrix multiplication routine. It was found that the performance of the matrix multiplication routine depends on the shape and size of the matrices. Therefore, we recommend different kernels tailored to matrix shapes involved in QR factorization and developed a new routine for the $A^{\rm{T}} \cdot B$ matrix–panel multiplication operation. We demonstrated the performance of the proposed kernels on the ScaLAPACK QR factorization routine by comparing them with the MKL, OPENBLAS, and BLIS libraries. Our proposed optimization demonstrates significant performance improvements in the multinode cluster environments of the Intel Xeon Phi Processor 7250 codenamed Knights Landing (KNL) and Intel Xeon Gold 6148 Scalable Skylake Processor (SKL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Trade-off Analysis of the Parallel Hybrid SPIKE Preconditioner in a Unique Multi-core Computer

Parallel Sparse Matrix Vector Multiplication on Intel MIC: Performance Analysis

High Performance Optimizations for Nuclear Physics Code MFDn on KNL

Availability of data and materials

Not applicable.

Code availability

Please visit GitHub Repository for the source code associated with this paper.

References

Choi J, Dongarra JJ, Ostrouchoy LS, Petitet AP, Whaley RC, Walker DW (1996) Design and Implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci Program. https://doi.org/10.1155/1996/483083
Article Google Scholar
Choi J, Dongarra JJ, Pozo R, Walker DW (1992) ScaLAPACK: ascalable linear algebra library for distributed memory concurrent computers. In: The Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Computer Society. pp 120–121. https://doi.org/10.1109/fmpc.1992.234898
Nassif N, Erhel J, Philippe B (2015) Basic linear algebra subprograms—BLAS. Introduction to computational linear algebra. https://doi.org/10.1201/b18662-7
Rizwan M, Jung E, Park Y, Choi J, Kim Y (2022) Optimization of matrix–matrix multiplication algorithm for matrix–panel multiplication on Intel KNL. In: 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA). IEEE. pp 1–7. https://doi.org/10.1109/AICCSA56895.2022.10017947
Gunnels JA, Henry GM, van de Geijn RA (2001) A family of high-performance matrix multiplication algorithms. In: Computational Science—ICCS 2001. 2073. pp 51–60. https://doi.org/10.1007/3-540-45545-0_15
Goto K, Geijn RAVD (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 10(1145/1356052):1356053
MathSciNet Google Scholar
Goto K, Geijn RVD (2008) High-performance implementation of the level-3 BLAS. ACM Trans Math Softw 10(1145/1377603):1377607
Google Scholar
Lim R, Lee Y, Kim R, Choi J (2018) An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput. https://doi.org/10.1007/s10586-018-2810-y
Article Google Scholar
Lim R, Lee Y, Kim R, Choi J, Lee M (2019) Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J Supercomput. https://doi.org/10.1007/s11227-018-2702-1
Article Google Scholar
Park Y, Kim R, Nguyen TMT, Choi J (2021) Improving blocked matrix–matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and Xeon scalable processors. Cluster Comput. https://doi.org/10.1007/s10586-021-03274-8
Article Google Scholar
Thi N, Tuyen M (2020). Thesis for the Degree of Master. Improving Performance of LU Factorization Routine on Intel KNL and Xeon Scalable Processors. Dissertation, Soongsil University, Korea
Intel (2021) Intel oneAPI Math Kernel Library (oneMKL) Overview. https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/overview.html. Accessed 21 Dec 2022
Xianyi Z, Qian W, Saar W (2023) OpenBLAS: an optimized BLAS library. https://www.openblas.net. Accessed 15 Apr 2023
Zee FGV, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw. https://doi.org/10.1145/2764454
Article MathSciNet Google Scholar
Anderson E, Bai Z, Dongarra J, Greenbaum A, McKenney A, Croz J D, Hammarling S, Demmel J, Bischof C, Sorensen D (1990) LAPACK: A portable linear algebra library for high-performance computers. In: Supercomputing ’90:Proceedings of the 1990 ACM/IEEE Conference on Supercomputing. pp 2–11. https://doi.org/10.1109/superc.1990.129995
Demmel J (1991) LAPACK: a portable linear algebra library for high-performance computers. Concurr: Pract Exp. https://doi.org/10.1002/cpe.4330030610
Article Google Scholar
Clarke L, Glendinning I, Hempel R (1994) The MPI message passing interface standard. Birkh, Basel. https://doi.org/10.1007/978-3-0348-8534-8_21
Book Google Scholar
Anderson E, Dongarra J, Ostrouchov S, Benzoni A, Moulton S, Tourancheau B, Geijn RVD (1991) Basic linear Algebra communication subprograms. In: The Sixth Distributed Memory Computing Conference, 1991. Proceedings. pp 287–290. https://doi.org/10.1109/DMCC.1991.633146
Choi J, Dongarra J, Ostrouchov S, Petitet A, Walker D, Whaley RC (1996) A proposal for a set of parallel basic linear algebra subprograms. In: Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science. Springer, Berlin. vol 1041, pp 107–114. https://doi.org/10.1007/3-540-60902-4_13
Smith TM, Geijn RVD, Smelyanskiy M, Hammond JR, Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp 1049–1059. https://doi.org/10.1109/IPDPS.2014.110
KISTI (2018) National Supercomputing Center. https://www.ksc.re.kr/eng/resources/nurion. Accessed 5 Aug 1 2023
Cantalupo C, Venkatesan V, Hammond J, Czurlyo K, Hammond SD (2022) Memkind: an extensible heap memory manager for heterogeneous memory platforms and mixed memory policies. https://github.com/memkind/memkind. Accessed 20 Dec 2022
Choi J, Dongarra JJ, Ostrouchoy LS, Petitet AP, Whaley RC, Walker DW (2022) ScaLAPACK—Scalable Linear Algebra PACKage. https://www.netlib.org/scalapack. Accessed 2 Dec 2022
Zee FGV, van de Geijn RA (2022) BLIS: BLAS-like Library Instantiation Software Framework. https://github.com/flame/blis. Accessed 28 Apr 2023

Download references

Funding

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (RS-2023-00321688), and this work was also supported by the Korean National Supercomputing Center (KSC) with supercomputing resources (Nos. KSC-2022-CRE-0202 and TS-2023-RE-0036).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Soongsil University, Seoul, Korea
Muhammad Rizwan, Enoch Jung, Jongsun Choi & Jaeyoung Choi

Authors

Muhammad Rizwan
View author publications
You can also search for this author inPubMed Google Scholar
Enoch Jung
View author publications
You can also search for this author inPubMed Google Scholar
Jongsun Choi
View author publications
You can also search for this author inPubMed Google Scholar
Jaeyoung Choi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

MR was involved in the conceptualization, methodology, validation, writing—original draft, software, and data curation. EJ contributed to validation, methodology, software, and writing—review and editing. JC assisted in the methodology and writing—review and editing. JC contributed to the supervision.

Corresponding author

Correspondence to Jaeyoung Choi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical Approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rizwan, M., Jung, E., Choi, J. et al. Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors. J Supercomput 80, 13813–13836 (2024). https://doi.org/10.1007/s11227-024-06002-2

Download citation

Accepted: 16 February 2024
Published: 13 March 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11227-024-06002-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Trade-off Analysis of the Parallel Hybrid SPIKE Preconditioner in a Unique Multi-core Computer

Parallel Sparse Matrix Vector Multiplication on Intel MIC: Performance Analysis

High Performance Optimizations for Nuclear Physics Code MFDn on KNL

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now