skip to main content
research-article

Elemental: A New Framework for Distributed Memory Dense Matrix Computations

Published: 01 February 2013 Publication History

Abstract

Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.

References

[1]
Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R., and Wu, Y.-J. J. 1997. PLAPACK: Parallel Linear Algebra Package: Design overview. In Proceedings of the Conference on Supercomputing.
[2]
Anderson, E., Benzoni, A., Dongarra, J., Moulton, S., Ostrouchov, S., Tourancheau, B., and van de Geijn, R. 1992. LAPACK for distributed memory architectures: Progress report. In Proceedings of the 5th SIAM Conference on Parallel Processing for Scientific Computing. SIAM, Philadelphia, PA, 625--630.
[3]
Anderson, E., Bai, Z., et al. 1999. LAPACK Users’ Guide 3rd Ed. SIAM, Philadelphia, PA.
[4]
Bennighof, J. K. and Lehoucq, R. 2003. An automated multilevel substructuring method for eigenspace computation in linear elastodynamics. SIAM J. Sci. Comput. 25, 2084--2106.
[5]
Bientinesi, P., Dhillon, I. S., and van de Geijn, R. A. 2005a. A parallel eigensolver for dense symmetric matrices based on multiple relatively robust representations. SIAM J. Sci. Comput. 27, 1, 43--66.
[6]
Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005b. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Trans. Math. Softw. 31, 1, 27--59.
[7]
Blackford, L. S., Choi, J., et al. 1997. ScaLAPACK Users’ Guide. SIAM.
[8]
Chan, E., Heimlich, M., Purkayastha, A., and van de Geijn, R. 2007a. Collective communication: theory, practice, and experience. Concurrency Comput. Pract. Exper. 19, 13, 1749--1783.
[9]
Chan, E., Quintana-Ortí, E., Quintana-Ortí, G., and van de Geijn, R. 2007b. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’07). 116--126.
[10]
Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P., Walker, D. W., and Whaley, R. C. 1994. The design and implementation of the ScaLAPACK LU, QR and Cholesky factorization routines. LAPACK Working Note 80 UT-CS-94-246, University of Tennessee.
[11]
Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., and van de Geijn, R. A. 1997. Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Pract. Exper. 9, 9, 837--857.
[12]
Cuppen, J. J. M. 1981. A divide and conquer method for the symmetric tridiagonal eigenvalue problem. Numer. Math. 36, 177--195.
[13]
Dhillon, I. S. 1997. A new O(n2) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. Ph.D. thesis, EECS Department, University of California, Berkeley.
[14]
Dongarra, J. and Ostrouchov, S. 1990. LAPACK block factorization algorithms on the Intel iPSC/860. LAPACK Working Note 24, Tech. rep. CS-90-115, University of Tennessee.
[15]
Dongarra, J. and van de Geijn, R. 1992. Reduction to condensed form on distributed memory architectures. Parallel Comput. 18, 973--982.
[16]
Dongarra, J., van de Geijn, R., and Walker, D. 1994. Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput. 22, 3.
[17]
Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1--17.
[18]
Edwards, C., Geng, P., Patra, A., and van de Geijn, R. 1995. Parallel matrix distributions: Have we been doing it all wrong? Tech. rep. TR-95-40, Department of Computer Sciences, University of Texas at Austin.
[19]
Ford, B. and Hall, G. 1974. The generalized eigenvalue problem in quantum chemistry. Comput. Phys. Commun. 8, 5, 337--348.
[20]
Golub, G. H. and Van Loan, C. F. 1989. Matrix Computations 2nd Ed. Johns Hopkins University Press, Baltimore, MD.
[21]
Goto, K. and van de Geijn, R. A. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3: Article 12.
[22]
Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw. 27, 4, 422--455.
[23]
Hendrickson, B., Jessup, E., and Smith, C. 1999. Toward an efficient parallel eigensolver for dense symmetric matrices. SIAM J. Sci. Comput. 20, 3, 1132--1154.
[24]
Hendrickson, B. A. and Womble, D. E. 1994. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput. 15, 5, 1201--1226.
[25]
Howard, J., Dighe, S., et al. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the International Solid-State Circuits Conference.
[26]
Joffrain, T., Low, T. M., Quintana-Ortí, E. S., van de Geijn, R., and Van Zee, F. G. 2006. Accumulating Householder transformations, revisited. ACM Trans. Math. Softw. 32, 2, 169--179.
[27]
Johnsson, S. L. 1987. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel Distrib. Comput. 4, 133--172.
[28]
Marker, B., Terrel, A., Poulson, J., Batory, D., and van de Geijn, R. 2011. Mechanizing the expert dense linear algebra developer. FLAME working note #58 TR-11-18, Department of Computer Sciences, University of Texas at Austin.
[29]
Marker, B., Chan, E., Poulson, J., van de Geijn, R., Van der Wijngaart, R. F., Mattson, T. G., and Kubaska, T. E. 2012. Programming many-core architectures - a case study: Dense matrix computations on the Intel SCC processor. Concurrency Comput. Pract. Exper. 24, 12, 1317--1333.
[30]
Mattson, T. G., Van der Wijngaart, R., and FRUMKIN, M. 2008. Programming the Intel 80-core network-on-a-chip terascale processor. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, 1--11.
[31]
Petitet, A., Whaley, R. C., Dongarra, J., and Cleary, A. HPL Algorithm. http://netlib.org/benchmark/hpl/algorithm.html.
[32]
Poulson, J., van de Geijn, R., and Bennighof, J. 2011. Parallel algorithms for reducing the generalized Hermitian-definite eigenvalue problem. FLAME working note #56. Tech. rep. TR-11-05, Department of Computer Sciences, University of Texas at Austin.
[33]
Quintana-Ortí, G., Quintana-Ortí, E. S., van de Geijn, R. A., Van Zee, F. G., and Chan, E. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36, 3, 14:1--14:26.
[34]
ScaLAPACK 2010. Home Page. http://www.netlib.org/scalapack/scalapack_home.html.
[35]
Schreiber, R. 1992. Scalability of sparse direct solvers. Graph Theory and Sparse Matrix Computations 56.
[36]
Sears, M. P., Stanley, K., and Henry, G. 1998. Application of a high performance parallel eigensolver to electronic structure calculations. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--1.
[37]
Stewart, G. 1990. Communication and matrix computations on large message passing systems. Parallel Comput. 16, 27--40.
[38]
Stewart, G. W. 1970. Incorporating origin shifts into the qr algorithm for symmetric tridiagonal matrices. Comm. ACM 13, 365--367.
[39]
Strazdins, P. E. 1998. Optimal load balancing techniques for block-cyclic decompositions for matrix factorization. In Proceedings of the 2nd International Conference on Parallel and Distributed Computing and Networks (PDCN’98).
[40]
van de Geijn, R. 1992. Dense linear solve on the Intel touchstone delta system. In Proceedings of the 37th IEEE Computer Society International Conference. (Digest of Papers.)
[41]
van de Geijn, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. MIT Press.
[42]
van de Geijn, R. A. and Quintana-Ortí, E. S. 2008. The science of programming matrix computations. http://www.lulu.com/content/1911788.
[43]
Van Zee, F. G. 2009. libflame: The Complete Reference. www.lulu.com.
[44]
Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. In Proceedings of the Conference on Supercomputing (SC’98).
[45]
Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. Oxford University Press, Oxford, UK.
[46]
Wu, Y.-J. J., Alpatov, P. A., Bischof, C., and van de Geijn, R. A. 1996. A parallel implementation of symmetric band reduction using PLAPACK. In Proceedings of the Scalable Parallel Library Conference, Mississippi State University.

Cited By

View all
  • (2025)Evolution of the SLATE linear algebra libraryInternational Journal of High Performance Computing Applications10.1177/1094342024128653139:1(3-17)Online publication date: 1-Jan-2025
  • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
  • (2023)Library Development with MPI: Attributes, Request Objects, Group Communicator Creation, Local Reductions, and DatatypesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615323(1-10)Online publication date: 11-Sep-2023
  • Show More Cited By

Index Terms

  1. Elemental: A New Framework for Distributed Memory Dense Matrix Computations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 39, Issue 2
    February 2013
    151 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/2427023
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 February 2013
    Accepted: 01 February 2012
    Revised: 01 January 2012
    Received: 01 September 2010
    Published in TOMS Volume 39, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Linear algebra
    2. high-performance
    3. libraries
    4. parallel computing

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Evolution of the SLATE linear algebra libraryInternational Journal of High Performance Computing Applications10.1177/1094342024128653139:1(3-17)Online publication date: 1-Jan-2025
    • (2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
    • (2023)Library Development with MPI: Attributes, Request Objects, Group Communicator Creation, Local Reductions, and DatatypesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615323(1-10)Online publication date: 11-Sep-2023
    • (2023)O(N) distributed direct factorization of structured dense matrices using runtime systems.Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605606(1-10)Online publication date: 7-Aug-2023
    • (2023)Parallel Memory-Independent Communication Bounds for SYRKProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591072(391-401)Online publication date: 17-Jun-2023
    • (2023)Inversion of Eddy-Current Signals Using a Level-Set Method and Block Krylov SolversSIAM Journal on Scientific Computing10.1137/20M138206445:3(B366-B389)Online publication date: 12-Jun-2023
    • (2023)Memristor-Based Spectral Decomposition of Matrices and Its ApplicationsIEEE Transactions on Computers10.1109/TC.2022.320274672:5(1460-1472)Online publication date: 1-May-2023
    • (2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
    • (2023)On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM)2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00044(357-367)Online publication date: May-2023
    • (2023)PAQR: Pivoting Avoiding QR factorization2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00040(322-332)Online publication date: May-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media