Skip to main content

Distributed general matrix multiply and add for a 2D mesh processor network

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1041))

Abstract

A distributed algorithm with the same functionality as the single-processor level 3 BLAS operation GEMM, i.e., general matrix multiply and add, is presented. With the same functionality we mean the ability to perform GEMM operations on arbitrary subarrays of the matrices involved. The logical network is a 2D square mesh with torus connectivity. The matrices involved are distributed with non-scattered blocked data distribution. The algorithm consists of two main parts, alignment and data movement of subarrays involved in the operation and a distributed blocked matrix multiplication algorithm on (sub)matrices using only a square submesh. Our general approach makes it possible to perform GEMM operations on non-overlapping submeshes simultaneously.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1992.

    Google Scholar 

  2. V. Cherkassky and R. Smith. Efficient mapping and implementation of matrix algorithms on a hypercube. Journal of Supercomputing, 2(1):7–27, 1988.

    Google Scholar 

  3. J. Choi, J. J. Dongarra, and D. W. Walker. Level 3 BLAS for distributed memory concurrent computers. In CNRS-NSF Workshop on Environments and Tools for Parallel Scientific Computing (Saint Hilaire du Touvet, France, September 7–8, 1992). Elsevier Science Publishers, 1992.

    Google Scholar 

  4. J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers. Technical Report ORNL/TM-12252, Oak Ridge National Laboratory, Oak Ridge, TN, April 1993.

    Google Scholar 

  5. E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal of Computing, 10(4):657–675, November 1981.

    Google Scholar 

  6. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 18(1):1–17, 1990.

    Google Scholar 

  7. G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, volume 1. Prentice-Hall, 1988.

    Google Scholar 

  8. G. A. Geist, A. Beguelin, Dongarra J. J., R. Manchek, and V. Sunderam. PVM 3.0 User's Guide and Reference Manual. Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, Oak Ridge, TN, February 1993.

    Google Scholar 

  9. G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley. A Users' Guide to PICL: A portable instrumented communication library. Technical Report ORNL/TM-11616, Oak Ridge National Laboratory, Oak Ridge, TN, September 1990.

    Google Scholar 

  10. S. Huss-Lederman, E. M. Jacobson, and G. Tsao, A. Zhang. Matrix multiplication on the Intel Touchstone Delta. Technical Report SRC-TR-93-101 (Revised), Supercomputing Research Center, Bowie, MD, February 1994.

    Google Scholar 

  11. B. Kågström, P. Ling, and C. Van Loan. High Performance GEMM-Based Level 3 BLAS: Sample Routines for Double Precision Real Data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269–281, Amsterdam, 1991. North-Holland.

    Google Scholar 

  12. B. Kågström, P. Ling, and C. Van Loan. Portable High Performance GEMM-Based Level 3 BLAS. In Richard F. et al Sincovec, editor, Parallel Processing for Scientific Computing, pages 339–346, Philadelphia, 1993. SIAM Publications.

    Google Scholar 

  13. M. Rännar. A Distributed, Portable and General GEMM Operation for a 2D Mesh Processor Network. Report UMINF-95.xx, Department of Computing Science, Umeå University, S-901 87 Umeå, Sweden, 1995.

    Google Scholar 

  14. R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Technical Report UT CS-95-286, LAPACK Working Note # 96, University of Tennessee, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jack Dongarra Kaj Madsen Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kågström, B., Rännar, M. (1996). Distributed general matrix multiply and add for a 2D mesh processor network. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds) Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science. PARA 1995. Lecture Notes in Computer Science, vol 1041. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60902-4_36

Download citation

  • DOI: https://doi.org/10.1007/3-540-60902-4_36

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60902-5

  • Online ISBN: 978-3-540-49670-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics