A high performance matrix multiplication algorithm for MPPs

Agarwal, Ramesh C.; Gustavson, Fred G.; Balle, Susanne M.; Joshi, Mahesh; Palkar, Prasad

doi:10.1007/3-540-60902-4_1

A high performance matrix multiplication algorithm for MPPs

Ramesh C. Agarwal¹,
Fred G. Gustavson¹,
Susanne M. Balle¹,
Mahesh Joshi¹ &
…
Prasad Palkar¹

Conference paper
First Online: 01 January 2005

214 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1041))

Abstract

A 3-dimensional (3-D) matrix multiplication algorithm for massively parallel processing systems is presented. Performing the product of two matrices C=β C+α A B is viewed as solving a 2-dimensional problem in the 3-dimensional computational space. The three dimensions correspond to the matrices dimensions m, k, and n: A ∈ R^m×k, B ∈ R ^k×n, and C ∈ R ^m×n. The p processors are configured as a “virtual” processing cube with dimensions p ₁, P ₂, and p ₃. The cube's dimensions are proportional to the matrices' dimensions-m, n, and k. Each processor performs a local matrix multiplication of size m/p ₁ × n/p ₂ × k/p ₃, on one of the sub-cubes in the computational space. Before the local computation can be carried out, each sub-cube needs to receive sub-matrices corresponding to the planes where A and B reside. After the single matrix multiplication has completed, the sub-matrices of C have to be reassigned to their respective processors. The 3-D parallel matrix multiplication approach has, to the best of our knowledge, the least amount of communication among all known parallel algorithms for matrix multiplication. Furthermore, the single resulting sub-matrix computation gives the best possible performance from the uni-processor matrix multiply routine. The 3-D approach achieves high performance for even relatively small matrices and/or a large number of processors (massively parallel). This algorithm has been implemented on IBM Power-parallel SP-2 systems (up to 216 nodes) and have yielded close to the peak performance of the machine. For large matrices, the algorithm can be combined with Winograd's variant of Strassen's algorithm to achieve “super-linear” speed-up. When the Winograd approach is used, the performance achieved per processor exceeds the theoretical peak of the system.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

R. C. Agarwal, F. G. Gustavson, S. M. Balle, M. Joshi, and P. Palkar. A 3-dimensional approach to parallel matrix multiplication. Technical report, IBM T. J. Watson Reasearch Center, Yorktown Heights, 1995. Under preparation.
Google Scholar
R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm on distributed-memory parallel computer, using overlapped communication. IBM Journal of Research and Development, pages 673–681, 1994.
Google Scholar
J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walter. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. Technical report, UT, 1992.
Google Scholar
J. Choi, J. J. Dongarra, and D. W. Walter. PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Technical report, UT, 1994.
Google Scholar
J. W. Demmel, M. T. Heath, and H. A. van der Vorst. Parallel numerical linear algebra. In Acta Numerica 1993, pages 111–197. Cambridge University press, 1993.
Google Scholar
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, May 1995.
Google Scholar
H. Franke, C. E. Wu, M. Riviere, P. Pattnaik, and M. Snir. MPI programming environment for IBM SP1/SP2. Technical report, IBM T. J. Watson Research Center, 1995.
Google Scholar
W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the message passing interface. MIT Press, 1994.
Google Scholar
A. Gupta and V. Kumar. Scalability of parallel algorithms for matrix multiplication. Technical report, Department of Computer Science, University of Minnesota, 1991. Revised April 1994.
Google Scholar
N. J. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Software, 16:352–368, 1990.
MathSciNet Google Scholar
A. Ho. Personal communications. IBM Almaden, 1995.
Google Scholar
IBM. Engineering and Scientific Subroutine Library, Guide and reference: SC23-0526-01. IBM, 1994.
Google Scholar
IBM. Scalable parallel computing. IBM Systems Journal, 34, No 2, 1995.
Google Scholar
S. L. Johnsson and C-T. Ho. Algorithms for multiplying matrices of arbitrary shapes using shared memory primitives on boolean cubes. Technical Report TR-569, Yale University, 1987.
Google Scholar
V. Strassen. Gaussian elimination is not optimal. Numer.Math., 13:354–356, 1969.
Google Scholar
R. van de Geijn and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical report, Department of Computer Science, University of Texas at Austin, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, 10598, Yorktown Heights, NY
Ramesh C. Agarwal, Fred G. Gustavson, Susanne M. Balle, Mahesh Joshi & Prasad Palkar

Authors

Ramesh C. Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Fred G. Gustavson
View author publications
You can also search for this author in PubMed Google Scholar
Susanne M. Balle
View author publications
You can also search for this author in PubMed Google Scholar
Mahesh Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Prasad Palkar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jack Dongarra Kaj Madsen Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agarwal, R.C., Gustavson, F.G., Balle, S.M., Joshi, M., Palkar, P. (1996). A high performance matrix multiplication algorithm for MPPs. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds) Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science. PARA 1995. Lecture Notes in Computer Science, vol 1041. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60902-4_1

Download citation

DOI: https://doi.org/10.1007/3-540-60902-4_1
Published: 01 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60902-5
Online ISBN: 978-3-540-49670-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics