short-paper

Free access

Communication costs of Strassen's matrix multiplication

Authors:

Oded SchwartzAuthors Info & Claims

Communications of the ACM, Volume 57, Issue 2

Pages 107 - 114

https://doi.org/10.1145/2556647.2556660

Published: 01 February 2014 Publication History

All formats PDF

Abstract

Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today's machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph.

Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel computing. Applying expansion analysis to Strassen's and other fast matrix multiplication algorithms, we obtain the first lower bounds on their communication costs. These bounds show that the current sequential algorithms are optimal but that previous parallel algorithms communicate more than necessary. Our new parallelization of Strassen's algorithm is communication-optimal and outperforms all previous matrix multiplication algorithms.

References

[1]

Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 5 (1995), 575--582.

Digital Library

[2]

Alon, N., Schwartz, O., Shapira, A. An elementary construction of constant-degree expanders. Combinator. Probab. Comput. 17, 3 (2008), 319--327.

Digital Library

[3]

Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D. LAPACK's User's Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. Also available from http://www.netlib.org/lapack/.

Digital Library

[4]

Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S. Communication Optimal Parallel Multiplication of Sparse Random Matrices. In Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures, (2013), ACM, New York, NY, USA.

Digital Library

[5]

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, (2012), ACM, New York, NY, USA, 77--79.

Digital Library

[6]

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12 (2012), ACM, New York, NY, USA, 193--204.

Digital Library

[7]

Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms. G. Even and D. Rawitz, eds., Volume 7659 of Lecture Notes in Computer Science (2012), Springer, Berlin-Heidelberg, 13--36.

Digital Library

[8]

Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architectures (2011), ACM, New York, NY, USA, 1--12.

Digital Library

[9]

Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.

[10]

Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. J. ACM (Dec. 2012) 59, 6, 32:1--32:23.

Digital Library

[11]

Cannon, L. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN (1969).

Digital Library

[12]

Christ, M., Demmel, J., Knight, N., Scanlon, T., Yelick, K. Communication lower bounds and optimal algorithms for programs that reference arrays -- Part I. Manuscript, 2013.

[13]

Demmel, J., Dumitriu, I., Holtz, O. Fast linear algebra is stable. Numer. Math. 108, 1 (2007), 59--91.

Digital Library

[14]

Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O. Communication-optimal parallel recursive rectangular matrix multiplication. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS) (2013), IEEE.

Digital Library

[15]

Demmel, J., Gearhart, A., Lipshitz, B., Schwartz, O. Perfect strong scaling using no additional energy. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium, IPDPS '13 (2013), IEEE.

Digital Library

[16]

Fuller, S.H., Millett, L.I., eds. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C., 2011, 200 pages, http://www.nap.edu.

Digital Library

[17]

Graham, S.L., Snir, M., Patterson, C.A., eds. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences. The National Academies Press, Washington, D.C., 2004, 289 pages, http://www.nap.edu.

[18]

Hong, J.W., Kung, H.T. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the 13th annual ACM Symposium on Theory of Computing (1981), ACM, New York, NY, USA, 326--333.

Digital Library

[19]

Hoory, S., Linial, N., Wigderson, A. Expander graphs and their applications. Bull. AMS 43(4), (2006), 439--561.

[20]

Irony, D., Toledo, S., Tiskin, A. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64, 9, (2004), 1017--1026.

Digital Library

[21]

Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O. Communication-avoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (2012), IEEE Computer Society Press, Los Alamitos, CA, USA, 101:1--101:11.

Digital Library

[22]

McColl, W.F., Tiskin, A. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24 (1999), 287--297.

[23]

Solomonik, E., Demmel, J. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (2011), Springer.

Digital Library

[24]

Strassen, V. Gaussian elimination is not optimal. Numer. Math. 13 (1969), 354--356.

Digital Library

[25]

Williams, V.V. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory of Computing, STOC '12 (2012), ACM, New York, NY, USA, 887--898.

Digital Library

Cited By

Nissim RSchwartz OShabo R(2024)Challenges in Parallel Matrix Chain MultiplicationJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-74430-3_7(120-140)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-74430-3_7
Dutta SCadambe VGrover P(2019)“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot ProductsIEEE Transactions on Information Theory10.1109/TIT.2019.292755865:10(6171-6193)Online publication date: Oct-2019
https://doi.org/10.1109/TIT.2019.2927558
Oo NChaikan P(2019)Efficient Implementation of Strassen's Algorithm for Memory Allocation using AVX Intrinsic on Multi-core Architecture2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)10.1109/ITC-CSCC.2019.8793377(1-4)Online publication date: Jun-2019
https://doi.org/10.1109/ITC-CSCC.2019.8793377
Show More Cited By

Index Terms

Communication costs of Strassen's matrix multiplication
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing

In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size. Our technique for defining the criterion which stops the recursions is more detailed ...
Graph expansion and communication costs of fast matrix multiplication

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 57, Issue 2

February 2014

103 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/2556647

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2014

Published in CACM Volume 57, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
7,365
Total Downloads

Downloads (Last 12 months)628
Downloads (Last 6 weeks)83

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nissim RSchwartz OShabo R(2024)Challenges in Parallel Matrix Chain MultiplicationJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-74430-3_7(120-140)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-74430-3_7
Dutta SCadambe VGrover P(2019)“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot ProductsIEEE Transactions on Information Theory10.1109/TIT.2019.292755865:10(6171-6193)Online publication date: Oct-2019
https://doi.org/10.1109/TIT.2019.2927558
Oo NChaikan P(2019)Efficient Implementation of Strassen's Algorithm for Memory Allocation using AVX Intrinsic on Multi-core Architecture2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)10.1109/ITC-CSCC.2019.8793377(1-4)Online publication date: Jun-2019
https://doi.org/10.1109/ITC-CSCC.2019.8793377
Kawu AYahaya Umar ABala S(2017)Performance of one-level recursion parallel Strassen's algorithm on dual core processor2017 IEEE 3rd International Conference on Electro-Technology for National Development (NIGERCON)10.1109/NIGERCON.2017.8281929(587-591)Online publication date: Nov-2017
https://doi.org/10.1109/NIGERCON.2017.8281929
Chiou PSun YYoung G(2017)A complexity analysis of the JPEG image compression algorithm2017 9th Computer Science and Electronic Engineering (CEEC)10.1109/CEEC.2017.8101601(65-70)Online publication date: Sep-2017
https://doi.org/10.1109/CEEC.2017.8101601
Aït-Kaci HAmir S(2017)Classifying and querying very large taxonomies with bit-vector encodingJournal of Intelligent Information Systems10.1007/s10844-015-0383-248:1(1-25)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s10844-015-0383-2
Dutta SCadambe VGrover P(2016)"Short-Dot"Proceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157331(2100-2108)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.5555/3157096.3157331
Khan AAl-Mouhamed MFatayer AMohammad N(2016)Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-CoreInternational Journal of Parallel Programming10.1007/s10766-015-0378-144:4(801-830)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1007/s10766-015-0378-1
Censor-Hillel KKaski PKorhonen JLenzen CPaz ASuomela J(2016)Algebraic methods in the congested cliqueDistributed Computing10.1007/s00446-016-0270-232:6(461-478)Online publication date: 19-Mar-2016
https://doi.org/10.1007/s00446-016-0270-2
Censor-Hillel KKaski PKorhonen JLenzen CPaz ASuomela JGeorgiou CSpirakis P(2015)Algebraic Methods in the Congested CliqueProceedings of the 2015 ACM Symposium on Principles of Distributed Computing10.1145/2767386.2767414(143-152)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2767386.2767414
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF Chinese translation

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents