skip to main content
short-paper
Free access

Communication costs of Strassen's matrix multiplication

Published: 01 February 2014 Publication History

Abstract

Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today's machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph.
Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel computing. Applying expansion analysis to Strassen's and other fast matrix multiplication algorithms, we obtain the first lower bounds on their communication costs. These bounds show that the current sequential algorithms are optimal but that previous parallel algorithms communicate more than necessary. Our new parallelization of Strassen's algorithm is communication-optimal and outperforms all previous matrix multiplication algorithms.<!-- END_PAGE_1 -->

References

[1]
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 5 (1995), 575--582.
[2]
Alon, N., Schwartz, O., Shapira, A. An elementary construction of constant-degree expanders. Combinator. Probab. Comput. 17, 3 (2008), 319--327.
[3]
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D. LAPACK's User's Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. Also available from http://www.netlib.org/lapack/.
[4]
Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S. Communication Optimal Parallel Multiplication of Sparse Random Matrices. In Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures, (2013), ACM, New York, NY, USA.
[5]
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, (2012), ACM, New York, NY, USA, 77--79.
[6]
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12 (2012), ACM, New York, NY, USA, 193--204.
[7]
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Design and Analysis of Algorithms. G. Even and D. Rawitz, eds., Volume 7659 of Lecture Notes in Computer Science (2012), Springer, Berlin-Heidelberg, 13--36.
[8]
Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. In Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architectures (2011), ACM, New York, NY, USA, 1--12.
[9]
Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.
[10]
Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. J. ACM (Dec. 2012) 59, 6, 32:1--32:23.
[11]
Cannon, L. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN (1969).
[12]
Christ, M., Demmel, J., Knight, N., Scanlon, T., Yelick, K. Communication lower bounds and optimal algorithms for programs that reference arrays -- Part I. Manuscript, 2013.
[13]
Demmel, J., Dumitriu, I., Holtz, O. Fast linear algebra is stable. Numer. Math. 108, 1 (2007), 59--91.
[14]
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O. Communication-optimal parallel recursive rectangular matrix multiplication. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS) (2013), IEEE.
[15]
Demmel, J., Gearhart, A., Lipshitz, B., Schwartz, O. Perfect strong scaling using no additional energy. In Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium, IPDPS '13 (2013), IEEE.
[16]
Fuller, S.H., Millett, L.I., eds. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C., 2011, 200 pages, http://www.nap.edu.
[17]
Graham, S.L., Snir, M., Patterson, C.A., eds. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences. The National Academies Press, Washington, D.C., 2004, 289 pages, http://www.nap.edu.
[18]
Hong, J.W., Kung, H.T. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the 13th annual ACM Symposium on Theory of Computing (1981), ACM, New York, NY, USA, 326--333.
[19]
Hoory, S., Linial, N., Wigderson, A. Expander graphs and their applications. Bull. AMS 43(4), (2006), 439--561.
[20]
Irony, D., Toledo, S., Tiskin, A. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64, 9, (2004), 1017--1026.
[21]
Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O. Communication-avoiding parallel Strassen: Implementation and performance. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (2012), IEEE Computer Society Press, Los Alamitos, CA, USA, 101:1--101:11.
[22]
McColl, W.F., Tiskin, A. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24 (1999), 287--297.
[23]
Solomonik, E., Demmel, J. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (2011), Springer.
[24]
Strassen, V. Gaussian elimination is not optimal. Numer. Math. 13 (1969), 354--356.
[25]
Williams, V.V. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory of Computing, STOC '12 (2012), ACM, New York, NY, USA, 887--898.

Cited By

View all
  • (2024)Challenges in Parallel Matrix Chain MultiplicationJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-74430-3_7(120-140)Online publication date: 30-May-2024
  • (2019)“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot ProductsIEEE Transactions on Information Theory10.1109/TIT.2019.292755865:10(6171-6193)Online publication date: Oct-2019
  • (2019)Efficient Implementation of Strassen's Algorithm for Memory Allocation using AVX Intrinsic on Multi-core Architecture2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)10.1109/ITC-CSCC.2019.8793377(1-4)Online publication date: Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 57, Issue 2
February 2014
103 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2556647
  • Editor:
  • Moshe Y. Vardi
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2014
Published in CACM Volume 57, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)628
  • Downloads (Last 6 weeks)83
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Challenges in Parallel Matrix Chain MultiplicationJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-74430-3_7(120-140)Online publication date: 30-May-2024
  • (2019)“Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot ProductsIEEE Transactions on Information Theory10.1109/TIT.2019.292755865:10(6171-6193)Online publication date: Oct-2019
  • (2019)Efficient Implementation of Strassen's Algorithm for Memory Allocation using AVX Intrinsic on Multi-core Architecture2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)10.1109/ITC-CSCC.2019.8793377(1-4)Online publication date: Jun-2019
  • (2017)Performance of one-level recursion parallel Strassen's algorithm on dual core processor2017 IEEE 3rd International Conference on Electro-Technology for National Development (NIGERCON)10.1109/NIGERCON.2017.8281929(587-591)Online publication date: Nov-2017
  • (2017)A complexity analysis of the JPEG image compression algorithm2017 9th Computer Science and Electronic Engineering (CEEC)10.1109/CEEC.2017.8101601(65-70)Online publication date: Sep-2017
  • (2017)Classifying and querying very large taxonomies with bit-vector encodingJournal of Intelligent Information Systems10.1007/s10844-015-0383-248:1(1-25)Online publication date: 1-Feb-2017
  • (2016)"Short-Dot"Proceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157331(2100-2108)Online publication date: 5-Dec-2016
  • (2016)Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-CoreInternational Journal of Parallel Programming10.1007/s10766-015-0378-144:4(801-830)Online publication date: 1-Aug-2016
  • (2016)Algebraic methods in the congested cliqueDistributed Computing10.1007/s00446-016-0270-232:6(461-478)Online publication date: 19-Mar-2016
  • (2015)Algebraic Methods in the Congested CliqueProceedings of the 2015 ACM Symposium on Principles of Distributed Computing10.1145/2767386.2767414(143-152)Online publication date: 21-Jul-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDFChinese translation

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media