Parallel Complexity of Matrix Multiplication1

Santos, Eunice E.

doi:10.1023/A:1023996628662

Parallel Complexity of Matrix Multiplication¹

Published: June 2003

Volume 25, pages 155–175, (2003)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Eunice E. Santos¹

187 Accesses
4 Citations
Explore all metrics

Abstract

Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P.Palkar. A 3D approach to parallel matrix multiplication. IBM J. Res. Develop., 1995.
R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high-performance matrix multiplication algorithm on a distributed-memory parallel computer using overlapped communication. IBM J. Res. Develop., 1994.
A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. In Theoretical Computer Science, 3-28, March 1990.
J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12:335-342, 1989.
Google Scholar
L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, 1969.
S. Chatterjee, A. Lebeck, P. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Symposium on Parallel Algorithms and Architecture, 1999.
V. Cherkassky and R. Smith. Efficient mapping and implementation of matrix algorithms on a hypercube. J. Supercomputing, 2:7-27, 1988.
Google Scholar
J. Choi, J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers. Concurrency: Pract. & Exper., Vol. 6, October 1994.
D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. Santos, K. E. Schauser, R. Subramonian, and T. von Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 37(11):78-85, 1996.
Google Scholar
E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal of Computer, 10:657-673, 1981.
Google Scholar
J. W. Demmel, M. T. Heath, and H. A. van der Vorst. Parallel numerical linear algebra. Technical Report UCB/CSD 93/703, University of California at Berkeley, 1993.
G. Fox. Domain decomposition in distributed and shared memory environments. In Proceedings of the Int. Conf. on Supercomputing, 1042-1073, 1987.
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, Vol. i, Prentice-Hall, 1988.
G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube i: Matrix multiplication. Parallel Computing, 4:17-31, 1987.
Google Scholar
A. Gupta and V. Kumar. Scalability of parallel algorithms for matrix multiplications. Technical Report 91-54, University of Minnesota, 1991.
C.-T. Ho, S. L. Johnsson, and A. Edelman. Matrix multiplication on hypercubes using full band bandwidth and constant storage. In Proceedings of the Sixth Distributed-Memory Computing Conference, 1991.
S. Huss-Lederman, E. M. Jacobson, and A. Tsao. Comparison of scalable parallel matrix multiplication libraries. In Proceedings of the Parallel Scalable Library Conference, 1993.
S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhang. Matrix multiplication on the intel touchstone delta. In Proceedings of the Sixth SIAM Conference on Parallel Processing and Scientific Computing, 1993.
R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal broadcast and summation on the LogP model. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, 1993.
G. Li, A. Skjellum, and R. D. Falgout. A poly-algorithm for parallel dense matrix multiplication on 2D process grid topologies. Concurrency: Pract. and Expr., 9(5):345-389, 1997.
Google Scholar
K. Li. Scalable parallel matrix multiplication on distributed memory parallel computers. In Proceedings of the International Parallel and Distributed Processing Symposium, 2000.
C. Lin and L. Snyder. A matrix product algorithm and its comparative performance on hypercubes. In Proceedings of the Scalable High Performance Computing Conference, 1992.
S. Sahni. Matrix multiplication and data routing using a partitioned optical passive stars network. IEEE Trans. on Parallel and Distributed Systems, 11(7), 2000.
E. E. Santos. Optimal and efficient parallel algorithms for summing and prefix summing. J. Parallel and Distributed Computing, 62(4), 517-543, 2002.
Google Scholar
E. E. Santos. Optimal parallel algorithms for matrix multiplication. In Proceedings of the Eighth SIAM Conference on Parallel Processing and Scientific Computing, 1997.
E. E. Santos. Optimal parallel algorithms for solving tridiagonal linear systems. In Springer-Verlag Lecture Notes in Computer Science #1300, 1997.
R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. In Concurrency: Pract. & Exper., Vol. 9, April 1997.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Virginia Polytechnic Institute & State University, Blacksburg, VA, 24061
Eunice E. Santos

Authors

Eunice E. Santos
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santos, E.E. Parallel Complexity of Matrix Multiplication¹ . The Journal of Supercomputing 25, 155–175 (2003). https://doi.org/10.1023/A:1023996628662

Download citation

Issue Date: June 2003
DOI: https://doi.org/10.1023/A:1023996628662

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Complexity of Matrix Multiplication¹

Abstract

Access this article

Similar content being viewed by others

A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic

Graph algorithms: parallelization and scalability

On the Complexity and Parallel Implementation of Hensel’s Lemma and Weierstrass Preparation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Parallel Complexity of Matrix Multiplication1

Abstract

Access this article

Similar content being viewed by others

A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic

Graph algorithms: parallelization and scalability

On the Complexity and Parallel Implementation of Hensel’s Lemma and Weierstrass Preparation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation

Parallel Complexity of Matrix Multiplication¹