Abstract
Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.
Similar content being viewed by others
References
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P.Palkar. A 3D approach to parallel matrix multiplication. IBM J. Res. Develop., 1995.
R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high-performance matrix multiplication algorithm on a distributed-memory parallel computer using overlapped communication. IBM J. Res. Develop., 1994.
A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. In Theoretical Computer Science, 3-28, March 1990.
J. Berntsen. Communication efficient matrix multiplication on hypercubes. Parallel Computing, 12:335-342, 1989.
L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, 1969.
S. Chatterjee, A. Lebeck, P. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Symposium on Parallel Algorithms and Architecture, 1999.
V. Cherkassky and R. Smith. Efficient mapping and implementation of matrix algorithms on a hypercube. J. Supercomputing, 2:7-27, 1988.
J. Choi, J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers. Concurrency: Pract. & Exper., Vol. 6, October 1994.
D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. Santos, K. E. Schauser, R. Subramonian, and T. von Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 37(11):78-85, 1996.
E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal of Computer, 10:657-673, 1981.
J. W. Demmel, M. T. Heath, and H. A. van der Vorst. Parallel numerical linear algebra. Technical Report UCB/CSD 93/703, University of California at Berkeley, 1993.
G. Fox. Domain decomposition in distributed and shared memory environments. In Proceedings of the Int. Conf. on Supercomputing, 1042-1073, 1987.
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, Vol. i, Prentice-Hall, 1988.
G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube i: Matrix multiplication. Parallel Computing, 4:17-31, 1987.
A. Gupta and V. Kumar. Scalability of parallel algorithms for matrix multiplications. Technical Report 91-54, University of Minnesota, 1991.
C.-T. Ho, S. L. Johnsson, and A. Edelman. Matrix multiplication on hypercubes using full band bandwidth and constant storage. In Proceedings of the Sixth Distributed-Memory Computing Conference, 1991.
S. Huss-Lederman, E. M. Jacobson, and A. Tsao. Comparison of scalable parallel matrix multiplication libraries. In Proceedings of the Parallel Scalable Library Conference, 1993.
S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhang. Matrix multiplication on the intel touchstone delta. In Proceedings of the Sixth SIAM Conference on Parallel Processing and Scientific Computing, 1993.
R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal broadcast and summation on the LogP model. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, 1993.
G. Li, A. Skjellum, and R. D. Falgout. A poly-algorithm for parallel dense matrix multiplication on 2D process grid topologies. Concurrency: Pract. and Expr., 9(5):345-389, 1997.
K. Li. Scalable parallel matrix multiplication on distributed memory parallel computers. In Proceedings of the International Parallel and Distributed Processing Symposium, 2000.
C. Lin and L. Snyder. A matrix product algorithm and its comparative performance on hypercubes. In Proceedings of the Scalable High Performance Computing Conference, 1992.
S. Sahni. Matrix multiplication and data routing using a partitioned optical passive stars network. IEEE Trans. on Parallel and Distributed Systems, 11(7), 2000.
E. E. Santos. Optimal and efficient parallel algorithms for summing and prefix summing. J. Parallel and Distributed Computing, 62(4), 517-543, 2002.
E. E. Santos. Optimal parallel algorithms for matrix multiplication. In Proceedings of the Eighth SIAM Conference on Parallel Processing and Scientific Computing, 1997.
E. E. Santos. Optimal parallel algorithms for solving tridiagonal linear systems. In Springer-Verlag Lecture Notes in Computer Science #1300, 1997.
R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. In Concurrency: Pract. & Exper., Vol. 9, April 1997.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Santos, E.E. Parallel Complexity of Matrix Multiplication1 . The Journal of Supercomputing 25, 155–175 (2003). https://doi.org/10.1023/A:1023996628662
Issue Date:
DOI: https://doi.org/10.1023/A:1023996628662