skip to main content
10.1145/2486159.2486198acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Published: 23 July 2013 Publication History

Abstract

High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.

References

[1]
N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Euro-Par '00: Proceedings from the 6th International Euro-Par Conference on Parallel Processing, pages 368--378, London, UK, 2000. Springer-Verlag.
[2]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK's user's guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. Also available from http://www.netlib.org/lapack/.
[3]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In SPAA '09: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pages 245--252, New York, NY, USA, 2009. ACM.
[4]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. SIAM Journal on Scientific Computing, 32(6):3495--3523, 2010.
[5]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Applications, 32(3):866--901, 2011.
[6]
J. Demmel. LAPACK Working Note 53: Trading off parallelism and numerical stability. Technical report, University of Tennessee, Knoxville, TN, USA, 1992.
[7]
J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59--91, 2007.
[8]
E. Elmroth and F. Gustavson. New serial and parallel recursive QR factorization algorithms for SMP systems. Applied Parallel Computing Large Scale Scientific and Industrial Problems, pages 120--128, 1998.
[9]
E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3--45, 2004.
[10]
J. Frens and D. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206--216, 1997.
[11]
J. Frens and D. Wise. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not., 38(10):144--154, 2003.
[12]
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, pages 285--297, Washington, DC, USA, 1999. IEEE Computer Society.
[13]
L. Grigori, J. Demmel, and H. Xiang. CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications, 32(4):1317--1350, 2011.
[14]
F. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev., 41(6):737--756, 1997.
[15]
F. Gustavson, A. Henriksson, I. Jonsson, B. Kågström, and P. Ling. Recursive blocked data formats and BLAS's for dense linear algebra algorithms. Applied Parallel Computing Large Scale Scientific and Industrial Problems, pages 195--206, 1998.
[16]
J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the thirteenth annual ACM symposium on theory of computing, pages 326--333. ACM, 1981.
[17]
D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017--1026, 2004.
[18]
A. Khabou, J. Demmel, L. Grigori, and M. Gu. LU factorization with panel rank revealing pivoting and its communication avoiding version. Technical Report UCB/EECS-2012-15, EECS Department, University of California, Berkeley, Jan 2012.
[19]
G. Morton. A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, 1966.
[20]
E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par'11: Proceedings of the 17th International European Conference on Parallel and Distributed Computing. Springer, 2011.
[21]
S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18(4):1065--1081, 1997.

Cited By

View all
  • (2017)Energy Avoiding Matrix MultiplyLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_5(55-70)Online publication date: 24-Jan-2017
  • (2017)Introduction to Communication Avoiding Algorithms for Direct Methods of Factorization in Linear AlgebraComputational Mathematics, Numerical Analysis and Applications10.1007/978-3-319-49631-3_4(153-185)Online publication date: 5-Aug-2017
  • (2015)Avoiding Communication in Successive Band ReductionACM Transactions on Parallel Computing10.1145/26868771:2(1-37)Online publication date: 18-Feb-2015
  • Show More Cited By

Index Terms

  1. Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SPAA '13: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
    July 2013
    348 pages
    ISBN:9781450315722
    DOI:10.1145/2486159
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cache oblivious algorithms
    2. communication-avoiding algorithms
    3. matrix data layouts
    4. matrix factorization

    Qualifiers

    • Research-article

    Conference

    SPAA '13

    Acceptance Rates

    SPAA '13 Paper Acceptance Rate 31 of 130 submissions, 24%;
    Overall Acceptance Rate 447 of 1,461 submissions, 31%

    Upcoming Conference

    SPAA '25
    37th ACM Symposium on Parallelism in Algorithms and Architectures
    July 28 - August 1, 2025
    Portland , OR , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Energy Avoiding Matrix MultiplyLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_5(55-70)Online publication date: 24-Jan-2017
    • (2017)Introduction to Communication Avoiding Algorithms for Direct Methods of Factorization in Linear AlgebraComputational Mathematics, Numerical Analysis and Applications10.1007/978-3-319-49631-3_4(153-185)Online publication date: 5-Aug-2017
    • (2015)Avoiding Communication in Successive Band ReductionACM Transactions on Parallel Computing10.1145/26868771:2(1-37)Online publication date: 18-Feb-2015
    • (2014)Is multicore hardware for general-purpose parallel processing broken?Communications of the ACM10.1145/258094557:4(35-39)Online publication date: 1-Apr-2014

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media