Skip to main content

A high performance matrix multiplication algorithm for MPPs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1041))

Abstract

A 3-dimensional (3-D) matrix multiplication algorithm for massively parallel processing systems is presented. Performing the product of two matrices C=β C+α A B is viewed as solving a 2-dimensional problem in the 3-dimensional computational space. The three dimensions correspond to the matrices dimensions m, k, and n: A ∈ Rm×k, B ∈ R k×n, and C ∈ R m×n. The p processors are configured as a “virtual” processing cube with dimensions p 1, P 2, and p 3. The cube's dimensions are proportional to the matrices' dimensions-m, n, and k. Each processor performs a local matrix multiplication of size m/p 1 × n/p 2 × k/p 3, on one of the sub-cubes in the computational space. Before the local computation can be carried out, each sub-cube needs to receive sub-matrices corresponding to the planes where A and B reside. After the single matrix multiplication has completed, the sub-matrices of C have to be reassigned to their respective processors. The 3-D parallel matrix multiplication approach has, to the best of our knowledge, the least amount of communication among all known parallel algorithms for matrix multiplication. Furthermore, the single resulting sub-matrix computation gives the best possible performance from the uni-processor matrix multiply routine. The 3-D approach achieves high performance for even relatively small matrices and/or a large number of processors (massively parallel). This algorithm has been implemented on IBM Power-parallel SP-2 systems (up to 216 nodes) and have yielded close to the peak performance of the machine. For large matrices, the algorithm can be combined with Winograd's variant of Strassen's algorithm to achieve “super-linear” speed-up. When the Winograd approach is used, the performance achieved per processor exceeds the theoretical peak of the system.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. C. Agarwal, F. G. Gustavson, S. M. Balle, M. Joshi, and P. Palkar. A 3-dimensional approach to parallel matrix multiplication. Technical report, IBM T. J. Watson Reasearch Center, Yorktown Heights, 1995. Under preparation.

    Google Scholar 

  2. R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm on distributed-memory parallel computer, using overlapped communication. IBM Journal of Research and Development, pages 673–681, 1994.

    Google Scholar 

  3. J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walter. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. Technical report, UT, 1992.

    Google Scholar 

  4. J. Choi, J. J. Dongarra, and D. W. Walter. PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Technical report, UT, 1994.

    Google Scholar 

  5. J. W. Demmel, M. T. Heath, and H. A. van der Vorst. Parallel numerical linear algebra. In Acta Numerica 1993, pages 111–197. Cambridge University press, 1993.

    Google Scholar 

  6. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, May 1995.

    Google Scholar 

  7. H. Franke, C. E. Wu, M. Riviere, P. Pattnaik, and M. Snir. MPI programming environment for IBM SP1/SP2. Technical report, IBM T. J. Watson Research Center, 1995.

    Google Scholar 

  8. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the message passing interface. MIT Press, 1994.

    Google Scholar 

  9. A. Gupta and V. Kumar. Scalability of parallel algorithms for matrix multiplication. Technical report, Department of Computer Science, University of Minnesota, 1991. Revised April 1994.

    Google Scholar 

  10. N. J. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Software, 16:352–368, 1990.

    MathSciNet  Google Scholar 

  11. A. Ho. Personal communications. IBM Almaden, 1995.

    Google Scholar 

  12. IBM. Engineering and Scientific Subroutine Library, Guide and reference: SC23-0526-01. IBM, 1994.

    Google Scholar 

  13. IBM. Scalable parallel computing. IBM Systems Journal, 34, No 2, 1995.

    Google Scholar 

  14. S. L. Johnsson and C-T. Ho. Algorithms for multiplying matrices of arbitrary shapes using shared memory primitives on boolean cubes. Technical Report TR-569, Yale University, 1987.

    Google Scholar 

  15. V. Strassen. Gaussian elimination is not optimal. Numer.Math., 13:354–356, 1969.

    Google Scholar 

  16. R. van de Geijn and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical report, Department of Computer Science, University of Texas at Austin, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jack Dongarra Kaj Madsen Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Agarwal, R.C., Gustavson, F.G., Balle, S.M., Joshi, M., Palkar, P. (1996). A high performance matrix multiplication algorithm for MPPs. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds) Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science. PARA 1995. Lecture Notes in Computer Science, vol 1041. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60902-4_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-60902-4_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60902-5

  • Online ISBN: 978-3-540-49670-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics