ABSTRACT
Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000). 1–12.Google ScholarDigital Library
- R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 (Sep. 1995), 575–582. Issue 5.Google Scholar
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA ’98. 119–129.Google Scholar
- Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012. 77–79.Google ScholarDigital Library
- Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal Parallel Algorithm for Strassen’s Matrix Multiplication. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’12). ACM, New York, NY, USA, 193–204.Google ScholarDigital Library
- Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds. CoRR abs/1202.3177(2012).Google Scholar
- Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.Google ScholarCross Ref
- Austin R. Benson and Grey Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP 2015). ACM, New York, NY, USA, 42–53.Google Scholar
- Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low Depth Cache-oblivious Algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’10). ACM, New York, NY, USA, 189–199.Google ScholarDigital Library
- Guy E. Blelloch and Yan Gu. 2020. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]. In 1st Symposium on Algorithmic Principles of Computer Systems, APOCS@SODA 2020, Salt Lake City, UT, USA, January 8, 2020. 105–119.Google Scholar
- Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’96, Padua, Italy, June 24-26, 1996. 297–308.Google ScholarDigital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Santa Barbara, California, 207–216.Google ScholarDigital Library
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (Sept. 1999), 720–748.Google ScholarDigital Library
- Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, and Wei Zhou. 2009. Memory Efficient Scheduling of Strassen-Winograd’s Matrix Multiplication Algorithm. In Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation(ISSAC ’09). ACM, New York, NY, USA, 55–62.Google ScholarDigital Library
- Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Bozeman, MT, USA. AAI7010025.Google ScholarDigital Library
- R. Chowdhury and V. Ramachandran. 2008. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 207–216.Google Scholar
- Richard Cole and Vijaya Ramachandran. 2012. Efficient Resource Oblivious Algorithms for Multicores with False Sharing. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. 201–214.Google Scholar
- Richard Cole and Vijaya Ramachandran. 2012. Revisiting the Cache Miss Analysis of Multithreaded Algorithms. In LATIN 2012: Theoretical Informatics - 10th Latin American Symposium, Arequipa, Peru, April 16-20, 2012. Proceedings. 172–183.Google Scholar
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms(third ed.). The MIT Press.Google ScholarDigital Library
- James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013. 261–272.Google Scholar
- David Dinh, Harsha Vardhan Simhadri, and Yuan Tang. 2016. Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers. In SPAA’16. Pacific Grove, CA, USA.Google Scholar
- Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1 (Jan. 2012), 4:1–4:22.Google ScholarDigital Library
- Matteo Frigo and Volker Strumpen. 2009. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput. Syst. 45, 2 (2009), 203–233.Google ScholarDigital Library
- Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write Asynchronous PRAM Model. Theor. Comput. Sci. 196, 1-2 (1998), 3–29.Google ScholarDigital Library
- Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms. SIAM J. Comput. 28, 2 (1998), 733–769.Google ScholarDigital Library
- Yan Gu. 2018. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming and Linear Algebra. CoRR abs/1809.09330(2018).Google Scholar
- Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Implementing Strassen’s Algorithm with BLIS. CoRR (2016).Google Scholar
- Intel Corporation 2010. Intel Cilk Plus Language Specification. Intel Corporation. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf.Google Scholar
- Joseph JáJá. 1992. An Introduction to Parallel Algorithms. Addison-Wesley.Google ScholarDigital Library
- Bharat Kumar, Chua-Huang Huang, P. Sadayappan, and Rodney W. Johnson. 1995. A Tensor Product Formulation of Strassen’s Matrix Multiplication Algorithm with Memory Reduction. Scientific Programming 4, 4 (1995), 275–289.Google ScholarDigital Library
- Charles E. Leiserson. 2010. The Cilk++ Concurrency Platform. Journal of Supercomputing 51, 3 (March 2010), 244–257.Google ScholarDigital Library
- Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: implementation and performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 101.Google ScholarDigital Library
- F. W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3 (1999), 287–297.Google ScholarCross Ref
- Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing contention through priority updates. In SPAA. 152–163.Google Scholar
- Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium(IPDPS ’14). IEEE Computer Society, Washington, DC, USA, 1049–1059.Google Scholar
- Edgar Solomonik and James Demmel. 2011. Communication-optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II(Euro-Par’11). Springer-Verlag, Berlin, Heidelberg, 90–109.Google ScholarCross Ref
- Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures(SPAA ’09). ACM, New York, NY, USA, 91–100.Google ScholarDigital Library
- Volker Strassen. 1969. Gaussian Elimination is not Optimal. Numer. Math. 14, 3 (1969), 354–356.Google ScholarDigital Library
Recommendations
Fast sparse matrix multiplication
Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m0.7n1.2+n2+o(1)) algebraic operations (i.e., multiplications, additions ...
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...
Nearly Optimal Algorithms For Canonical Matrix Forms
A Las-Vegas-type probabilistic algorithm is presented for finding the Frobenius canonical form of an $n\times n$ matrix $T$ over any field $\KK$. The algorithm requires $\softO(\MM(n))=\MM(n)\cdot(\log n)^{O(1)}$ operations in $\KK$, where $O(\MM(n))$ ...
Comments