research-article

Elemental: A New Framework for Distributed Memory Dense Matrix Computations

Authors:

Robert A. van de Geijn,

Jeff R. Hammond,

Nichols A. RomeroAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 39, Issue 2

Article No.: 13, Pages 1 - 24

https://doi.org/10.1145/2427023.2427030

Published: 01 February 2013 Publication History

Abstract

Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.

References

[1]

Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R., and Wu, Y.-J. J. 1997. PLAPACK: Parallel Linear Algebra Package: Design overview. In Proceedings of the Conference on Supercomputing.

Digital Library

[2]

Anderson, E., Benzoni, A., Dongarra, J., Moulton, S., Ostrouchov, S., Tourancheau, B., and van de Geijn, R. 1992. LAPACK for distributed memory architectures: Progress report. In Proceedings of the 5th SIAM Conference on Parallel Processing for Scientific Computing. SIAM, Philadelphia, PA, 625--630.

Digital Library

[3]

Anderson, E., Bai, Z., et al. 1999. LAPACK Users’ Guide 3rd Ed. SIAM, Philadelphia, PA.

Digital Library

[4]

Bennighof, J. K. and Lehoucq, R. 2003. An automated multilevel substructuring method for eigenspace computation in linear elastodynamics. SIAM J. Sci. Comput. 25, 2084--2106.

Digital Library

[5]

Bientinesi, P., Dhillon, I. S., and van de Geijn, R. A. 2005a. A parallel eigensolver for dense symmetric matrices based on multiple relatively robust representations. SIAM J. Sci. Comput. 27, 1, 43--66.

Digital Library

[6]

Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005b. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Trans. Math. Softw. 31, 1, 27--59.

Digital Library

[7]

Blackford, L. S., Choi, J., et al. 1997. ScaLAPACK Users’ Guide. SIAM.

[8]

Chan, E., Heimlich, M., Purkayastha, A., and van de Geijn, R. 2007a. Collective communication: theory, practice, and experience. Concurrency Comput. Pract. Exper. 19, 13, 1749--1783.

Digital Library

[9]

Chan, E., Quintana-Ortí, E., Quintana-Ortí, G., and van de Geijn, R. 2007b. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’07). 116--126.

Digital Library

[10]

Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P., Walker, D. W., and Whaley, R. C. 1994. The design and implementation of the ScaLAPACK LU, QR and Cholesky factorization routines. LAPACK Working Note 80 UT-CS-94-246, University of Tennessee.

Digital Library

[11]

Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., and van de Geijn, R. A. 1997. Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Pract. Exper. 9, 9, 837--857.

[12]

Cuppen, J. J. M. 1981. A divide and conquer method for the symmetric tridiagonal eigenvalue problem. Numer. Math. 36, 177--195.

Digital Library

[13]

Dhillon, I. S. 1997. A new O(n²) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. Ph.D. thesis, EECS Department, University of California, Berkeley.

Digital Library

[14]

Dongarra, J. and Ostrouchov, S. 1990. LAPACK block factorization algorithms on the Intel iPSC/860. LAPACK Working Note 24, Tech. rep. CS-90-115, University of Tennessee.

Digital Library

[15]

Dongarra, J. and van de Geijn, R. 1992. Reduction to condensed form on distributed memory architectures. Parallel Comput. 18, 973--982.

[16]

Dongarra, J., van de Geijn, R., and Walker, D. 1994. Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput. 22, 3.

Digital Library

[17]

Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1--17.

Digital Library

[18]

Edwards, C., Geng, P., Patra, A., and van de Geijn, R. 1995. Parallel matrix distributions: Have we been doing it all wrong? Tech. rep. TR-95-40, Department of Computer Sciences, University of Texas at Austin.

[19]

Ford, B. and Hall, G. 1974. The generalized eigenvalue problem in quantum chemistry. Comput. Phys. Commun. 8, 5, 337--348.

[20]

Golub, G. H. and Van Loan, C. F. 1989. Matrix Computations 2nd Ed. Johns Hopkins University Press, Baltimore, MD.

[21]

Goto, K. and van de Geijn, R. A. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3: Article 12.

Digital Library

[22]

Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw. 27, 4, 422--455.

Digital Library

[23]

Hendrickson, B., Jessup, E., and Smith, C. 1999. Toward an efficient parallel eigensolver for dense symmetric matrices. SIAM J. Sci. Comput. 20, 3, 1132--1154.

Digital Library

[24]

Hendrickson, B. A. and Womble, D. E. 1994. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput. 15, 5, 1201--1226.

Digital Library

[25]

Howard, J., Dighe, S., et al. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the International Solid-State Circuits Conference.

[26]

Joffrain, T., Low, T. M., Quintana-Ortí, E. S., van de Geijn, R., and Van Zee, F. G. 2006. Accumulating Householder transformations, revisited. ACM Trans. Math. Softw. 32, 2, 169--179.

Digital Library

[27]

Johnsson, S. L. 1987. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel Distrib. Comput. 4, 133--172.

Digital Library

[28]

Marker, B., Terrel, A., Poulson, J., Batory, D., and van de Geijn, R. 2011. Mechanizing the expert dense linear algebra developer. FLAME working note #58 TR-11-18, Department of Computer Sciences, University of Texas at Austin.

[29]

Marker, B., Chan, E., Poulson, J., van de Geijn, R., Van der Wijngaart, R. F., Mattson, T. G., and Kubaska, T. E. 2012. Programming many-core architectures - a case study: Dense matrix computations on the Intel SCC processor. Concurrency Comput. Pract. Exper. 24, 12, 1317--1333.

Digital Library

[30]

Mattson, T. G., Van der Wijngaart, R., and FRUMKIN, M. 2008. Programming the Intel 80-core network-on-a-chip terascale processor. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, 1--11.

Digital Library

[31]

Petitet, A., Whaley, R. C., Dongarra, J., and Cleary, A. HPL Algorithm. http://netlib.org/benchmark/hpl/algorithm.html.

[32]

Poulson, J., van de Geijn, R., and Bennighof, J. 2011. Parallel algorithms for reducing the generalized Hermitian-definite eigenvalue problem. FLAME working note #56. Tech. rep. TR-11-05, Department of Computer Sciences, University of Texas at Austin.

[33]

Quintana-Ortí, G., Quintana-Ortí, E. S., van de Geijn, R. A., Van Zee, F. G., and Chan, E. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36, 3, 14:1--14:26.

Digital Library

[34]

ScaLAPACK 2010. Home Page. http://www.netlib.org/scalapack/scalapack_home.html.

[35]

Schreiber, R. 1992. Scalability of sparse direct solvers. Graph Theory and Sparse Matrix Computations 56.

[36]

Sears, M. P., Stanley, K., and Henry, G. 1998. Application of a high performance parallel eigensolver to electronic structure calculations. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 1--1.

Digital Library

[37]

Stewart, G. 1990. Communication and matrix computations on large message passing systems. Parallel Comput. 16, 27--40.

[38]

Stewart, G. W. 1970. Incorporating origin shifts into the qr algorithm for symmetric tridiagonal matrices. Comm. ACM 13, 365--367.

Digital Library

[39]

Strazdins, P. E. 1998. Optimal load balancing techniques for block-cyclic decompositions for matrix factorization. In Proceedings of the 2nd International Conference on Parallel and Distributed Computing and Networks (PDCN’98).

[40]

van de Geijn, R. 1992. Dense linear solve on the Intel touchstone delta system. In Proceedings of the 37th IEEE Computer Society International Conference. (Digest of Papers.)

Digital Library

[41]

van de Geijn, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. MIT Press.

Digital Library

[42]

van de Geijn, R. A. and Quintana-Ortí, E. S. 2008. The science of programming matrix computations. http://www.lulu.com/content/1911788.

[43]

Van Zee, F. G. 2009. libflame: The Complete Reference. www.lulu.com.

[44]

Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. In Proceedings of the Conference on Supercomputing (SC’98).

Digital Library

[45]

Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. Oxford University Press, Oxford, UK.

[46]

Wu, Y.-J. J., Alpatov, P. A., Bischof, C., and van de Geijn, R. A. 1996. A parallel implementation of symmetric band reduction using PLAPACK. In Proceedings of the Scalable Parallel Library Conference, Mississippi State University.

Cited By

Heroux MGates MAbdelfattah AAkbudak KAl Farhan MAlomairy RBielich DBurgess TCayrols SLindquist NSukkari DYarKhan A(2025)Evolution of the SLATE linear algebra libraryInternational Journal of High Performance Computing Applications10.1177/1094342024128653139:1(3-17)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1177/10943420241286531
Sukkari DGates MAl Farhan MAnzt HDongarra J(2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624248
Träff JVardas I(2023)Library Development with MPI: Attributes, Request Objects, Group Communicator Creation, Local Reductions, and DatatypesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615323(1-10)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615323
Show More Cited By

Index Terms

Elemental: A New Framework for Distributed Memory Dense Matrix Computations
1. Mathematics of computing
  1. Mathematical software

Recommendations

Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

We study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is ...
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
PASC '18: Proceedings of the Platform for Advanced Scientific Computing Conference

We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. Following the idea of Approximate Computing,...
The STAPL parallel container framework
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 39, Issue 2

February 2013

151 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2427023

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2013

Accepted: 01 February 2012

Revised: 01 January 2012

Received: 01 September 2010

Published in TOMS Volume 39, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Microsoft
Intel Corporation
Argonne National Laboratory, Office of Science
Institute of Computational Engineering and Sciences
U.S. Department of Energy
Office of Cyberinfrastructure
Texas A and M University
Division of Computing and Communication Foundations
Sandia National Laboratories, National Nuclear Security Administration

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

142
Total Citations
View Citations
1,616
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)8

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Heroux MGates MAbdelfattah AAkbudak KAl Farhan MAlomairy RBielich DBurgess TCayrols SLindquist NSukkari DYarKhan A(2025)Evolution of the SLATE linear algebra libraryInternational Journal of High Performance Computing Applications10.1177/1094342024128653139:1(3-17)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1177/10943420241286531
Sukkari DGates MAl Farhan MAnzt HDongarra J(2023)Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware AcceleratorsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624248(1680-1687)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624248
Träff JVardas I(2023)Library Development with MPI: Attributes, Request Objects, Group Communicator Creation, Local Reductions, and DatatypesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615323(1-10)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615323
Deshmukh SYokota RBosilca GMa Q(2023)O(N) distributed direct factorization of structured dense matrices using runtime systems.Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605606(1-10)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605606
Al Daas HBallard GGrigori LKumar SRouse KAgrawal KShun J(2023)Parallel Memory-Independent Communication Bounds for SYRKProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591072(391-401)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591072
Audibert LGirardon HHaddar HJolivet P(2023)Inversion of Eddy-Current Signals Using a Level-Set Method and Block Krylov SolversSIAM Journal on Scientific Computing10.1137/20M138206445:3(B366-B389)Online publication date: 12-Jun-2023
https://doi.org/10.1137/20M1382064
Jalali ZWang CKearney GYuan GDing CZhou YWang YSoundarajan S(2023)Memristor-Based Spectral Decomposition of Matrices and Its ApplicationsIEEE Transactions on Computers10.1109/TC.2022.320274672:5(1460-1472)Online publication date: 1-May-2023
https://doi.org/10.1109/TC.2022.3202746
Kalam Azad MIqbal NHassan FRoy P(2023)An Empirical Study of High Performance Computing (HPC) Performance Bugs2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00037(194-206)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00037
Agullo EButtari ACoulaud OEyraud-Dubois LFaverge MFranc AGuermouche AJego APeressoni RPruvost F(2023)On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM)2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00044(357-367)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00044
Sid-Lakhdar WCayrols SBielich DAbdelfattah ALuszczek PGates MTomov SJohansen HWilliams-Young DDavis TDongarra JAnzt H(2023)PAQR: Pivoting Avoiding QR factorization2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00040(322-332)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents