skip to main content
10.1145/3293320.3293327acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm

Published: 14 January 2019 Publication History

Abstract

The TSQR algorithm is a communication-avoiding algorithm for computing the QR factorization of a tall and skinny (TS) matrix. The TSQR algorithm entails repeatedly executing a kernel that computes the QR factorization of a structured matrix. Although a single execution of structured QR requires small computational cost, it is repeated depending on the number of active parallel processes. The complicated computational pattern and small matrix size of structured QR are obstacles to achieving high performance. Thus, the computational cost of structured QR becomes a significant bottleneck in massively parallel computation. In this paper, we focus on the kernel of structured QR and discuss its implementation. We compare several kernels including those provided in LAPACK on modern processors, and investigate the impact of the different structured QR kernels on the overall performance of the TSQR algorithm.

References

[1]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.
[2]
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal Parallel and Sequential QR and LU Factorizations. SIAM J. Sci. Comp. 34, 1 (2012), 206--239.
[3]
Erik Elmroth and Fred G. Gustavson. 2000. Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance. IBM J. RES. DEV. 44, 4 (2000), 605--624.
[4]
Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto, and Yuka Yanagisawa. 2018. Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices. (2018). arXiv:1809.11085.
[5]
Takeshi Fukaya, Yuji Nakatsukasa, Yuka Yanagisawa, and Yusaku Yamamoto. 2014. CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'14). IEEE Press, 31--38.
[6]
Takeshi. Fukaya, Imamura. Toshiyuki, and Yamamoto. Yusaku. 2015. Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection. In High Performance Computing for Computational Science -- VECPAR 2014. 269--283.
[7]
Intel. 2018. Using MKL_DIRECT_CALL in Fortran Applications. (2018). Developer Guide for Intel Math Kernel Library 2019 for Linux.
[8]
Sarah Knepper. 2018. Just-in-time compilation: Speeding up small linear algebra operations. (2018). Invited talk in The Thirteenth International Workshop on Automatic Performance Tuning (iWAPT2018).
[9]
Daisuke Mori, Yusaku Yamamoto, and Zhang Shao-Liang. 2012. Backward error analysis of the AllReduce algorithm for Householder QR decomposition. Jpn. J. Ind. Appl. Math. 29, 1 (2012), 111--130.
[10]
Chiara Puglisi. 1992. Modification of the Householder Method Based on the Compact WY Representation. SIAM J. Sci. Stat. Comp. 13 (1992), 723--726.
[11]
Robert Schreiber and Charles F. van Loan. 1989. A Storage-efficient WY Representation for Products of Householder Transformations. SIAM J. Sci. Stat. Comp. 10 (1989), 53--57.
[12]
Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program Generation for Small-scale Linear Algebra Applications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). 327--339.
[13]
Andreas Stathopoulos and Kesheng Wu. 2002. A Block Orthogonalization Procedure With Constant Synchronization Requirements. SIAM J. Sci. Comp. 23 (2002), 2165--2182.
[14]
Lloyd N. Trefethen and III David Bau. 1997. Numerical Liner Algebra. SIAM, Philadelphia.
[15]
Richard Michael Veras, Tze Meng Low, Tyler Michael Smith, Robert A. van de Geijn, and Franz Franchetti. 2016. Automating the Last-Mile for High Performance Dense Linear Algebra. CoRR abs/1611.08035 (2016). arXiv:1611.08035
[16]
Yusaku. Yamamoto, Yuji. Nakatsukasa, Yuka. Yanagisawa, and Takeshi. Fukaya. 2015. Roundoff error analysis of the CholeskyQR2 algorihm. Electronic Transactions on Numerical Analysis 44 (2015), 306--326.
[17]
Yusaku. Yamamoto, Yuji. Nakatsukasa, Yuka. Yanagisawa, and Takeshi. Fukaya. 2016. Roundoff error analysis of the CholeskyQR2 algorithm in an oblique inner product. JSIAM Letters 8 (2016), 5--8.
[18]
Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. 2015. Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs. SIAM J. Sci. Comp. 37, 3 (2015), C307--C330.

Cited By

View all
  • (2022)Distributed Parallel Tall-Skinny QR Factorization: Performance Evaluation of Various Algorithms on Various SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_22(275-287)Online publication date: 7-Dec-2022

Index Terms

  1. An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
    January 2019
    143 pages
    ISBN:9781450366328
    DOI:10.1145/3293320
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Sun Yat-Sen University
    • CCF: China Computer Federation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 January 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TSQR algorithm
    2. communication-avoiding
    3. dense linear algebra
    4. implementation for small size problem
    5. performance evaluation
    6. tall-skinny QR

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    HPC Asia 2019

    Acceptance Rates

    HPCAsia '19 Paper Acceptance Rate 15 of 32 submissions, 47%;
    Overall Acceptance Rate 69 of 143 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Distributed Parallel Tall-Skinny QR Factorization: Performance Evaluation of Various Algorithms on Various SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_22(275-287)Online publication date: 7-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media