research-article

Improving the Space-Time Efficiency of Matrix Multiplication Algorithms

Author:
Yuan Tang

Fudan University, China

Fudan University, China
View Profile

ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel ProcessingAugust 2020Article No.: 14Pages 1–10https://doi.org/10.1145/3409390.3409404

Published:17 August 2020Publication History

ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing

Pages 1–10

ABSTRACT

Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.

References

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000). 1–12.Google ScholarDigital Library
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 (Sep. 1995), 575–582. Issue 5.Google Scholar
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA ’98. 119–129.Google Scholar
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012. 77–79.Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal Parallel Algorithm for Strassen’s Matrix Multiplication. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’12). ACM, New York, NY, USA, 193–204.Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds. CoRR abs/1202.3177(2012).Google Scholar
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.Google ScholarCross Ref
Austin R. Benson and Grey Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP 2015). ACM, New York, NY, USA, 42–53.Google Scholar
Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low Depth Cache-oblivious Algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’10). ACM, New York, NY, USA, 189–199.Google ScholarDigital Library
Guy E. Blelloch and Yan Gu. 2020. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]. In 1st Symposium on Algorithmic Principles of Computer Systems, APOCS@SODA 2020, Salt Lake City, UT, USA, January 8, 2020. 105–119.Google Scholar
Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’96, Padua, Italy, June 24-26, 1996. 297–308.Google ScholarDigital Library
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Santa Barbara, California, 207–216.Google ScholarDigital Library
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (Sept. 1999), 720–748.Google ScholarDigital Library
Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, and Wei Zhou. 2009. Memory Efficient Scheduling of Strassen-Winograd’s Matrix Multiplication Algorithm. In Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation(ISSAC ’09). ACM, New York, NY, USA, 55–62.Google ScholarDigital Library
Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Bozeman, MT, USA. AAI7010025.Google ScholarDigital Library
R. Chowdhury and V. Ramachandran. 2008. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 207–216.Google Scholar
Richard Cole and Vijaya Ramachandran. 2012. Efficient Resource Oblivious Algorithms for Multicores with False Sharing. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. 201–214.Google Scholar
Richard Cole and Vijaya Ramachandran. 2012. Revisiting the Cache Miss Analysis of Multithreaded Algorithms. In LATIN 2012: Theoretical Informatics - 10th Latin American Symposium, Arequipa, Peru, April 16-20, 2012. Proceedings. 172–183.Google Scholar
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms(third ed.). The MIT Press.Google ScholarDigital Library
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013. 261–272.Google Scholar
David Dinh, Harsha Vardhan Simhadri, and Yuan Tang. 2016. Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers. In SPAA’16. Pacific Grove, CA, USA.Google Scholar
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1 (Jan. 2012), 4:1–4:22.Google ScholarDigital Library
Matteo Frigo and Volker Strumpen. 2009. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput. Syst. 45, 2 (2009), 203–233.Google ScholarDigital Library
Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write Asynchronous PRAM Model. Theor. Comput. Sci. 196, 1-2 (1998), 3–29.Google ScholarDigital Library
Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms. SIAM J. Comput. 28, 2 (1998), 733–769.Google ScholarDigital Library
Yan Gu. 2018. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming and Linear Algebra. CoRR abs/1809.09330(2018).Google Scholar
Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Implementing Strassen’s Algorithm with BLIS. CoRR (2016).Google Scholar
Intel Corporation 2010. Intel Cilk Plus Language Specification. Intel Corporation. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf.Google Scholar
Joseph JáJá. 1992. An Introduction to Parallel Algorithms. Addison-Wesley.Google ScholarDigital Library
Bharat Kumar, Chua-Huang Huang, P. Sadayappan, and Rodney W. Johnson. 1995. A Tensor Product Formulation of Strassen’s Matrix Multiplication Algorithm with Memory Reduction. Scientific Programming 4, 4 (1995), 275–289.Google ScholarDigital Library
Charles E. Leiserson. 2010. The Cilk++ Concurrency Platform. Journal of Supercomputing 51, 3 (March 2010), 244–257.Google ScholarDigital Library
Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: implementation and performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 101.Google ScholarDigital Library
F. W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3 (1999), 287–297.Google ScholarCross Ref
Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing contention through priority updates. In SPAA. 152–163.Google Scholar
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium(IPDPS ’14). IEEE Computer Society, Washington, DC, USA, 1049–1059.Google Scholar
Edgar Solomonik and James Demmel. 2011. Communication-optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II(Euro-Par’11). Springer-Verlag, Berlin, Heidelberg, 90–109.Google ScholarCross Ref
Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures(SPAA ’09). ACM, New York, NY, USA, 91–100.Google ScholarDigital Library
Volker Strassen. 1969. Gaussian Elimination is not Optimal. Numer. Math. 14, 3 (1969), 354–356.Google ScholarDigital Library

Recommendations

Fast sparse matrix multiplication

Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m^0.7n^1.2+n^2+o(1)) algebraic operations (i.e., multiplications, additions ...
Read More
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...
Read More
Nearly Optimal Algorithms For Canonical Matrix Forms

A Las-Vegas-type probabilistic algorithm is presented for finding the Frobenius canonical form of an $n\times n$ matrix $T$ over any field $\KK$. The algorithm requires $\softO(\MM(n))=\MM(n)\cdot(\log n)^{O(1)}$ operations in $\KK$, where $O(\MM(n))$ ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing
August 2020
186 pages
ISBN:9781450388689
DOI:10.1145/3409390

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache-oblivious parallel algorithm
matrix multiplication
modern processor-oblivious runtime
shared-memory multi-core or many-core architecture
space-time efficiency
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate91of313submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 71
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Improving the Space-Time Efficiency of Matrix Multiplication Algorithms

ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Fast sparse matrix multiplication

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

Nearly Optimal Algorithms For Canonical Matrix Forms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Improving the Space-Time Efficiency of Matrix Multiplication Algorithms

ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Fast sparse matrix multiplication

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

Nearly Optimal Algorithms For Canonical Matrix Forms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media