skip to main content
10.1145/3409390.3409404acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Improving the Space-Time Efficiency of Matrix Multiplication Algorithms

Published:17 August 2020Publication History

ABSTRACT

Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.

References

  1. U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000). 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 (Sep. 1995), 575–582. Issue 5.Google ScholarGoogle Scholar
  3. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA ’98. 119–129.Google ScholarGoogle Scholar
  4. Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012. 77–79.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal Parallel Algorithm for Strassen’s Matrix Multiplication. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’12). ACM, New York, NY, USA, 193–204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds. CoRR abs/1202.3177(2012).Google ScholarGoogle Scholar
  7. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.Google ScholarGoogle ScholarCross RefCross Ref
  8. Austin R. Benson and Grey Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP 2015). ACM, New York, NY, USA, 42–53.Google ScholarGoogle Scholar
  9. Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low Depth Cache-oblivious Algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’10). ACM, New York, NY, USA, 189–199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Guy E. Blelloch and Yan Gu. 2020. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming [Extend Abstract]. In 1st Symposium on Algorithmic Principles of Computer Systems, APOCS@SODA 2020, Salt Lake City, UT, USA, January 8, 2020. 105–119.Google ScholarGoogle Scholar
  11. Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’96, Padua, Italy, June 24-26, 1996. 297–308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Santa Barbara, California, 207–216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (Sept. 1999), 720–748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, and Wei Zhou. 2009. Memory Efficient Scheduling of Strassen-Winograd’s Matrix Multiplication Algorithm. In Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation(ISSAC ’09). ACM, New York, NY, USA, 55–62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Bozeman, MT, USA. AAI7010025.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Chowdhury and V. Ramachandran. 2008. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 207–216.Google ScholarGoogle Scholar
  17. Richard Cole and Vijaya Ramachandran. 2012. Efficient Resource Oblivious Algorithms for Multicores with False Sharing. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. 201–214.Google ScholarGoogle Scholar
  18. Richard Cole and Vijaya Ramachandran. 2012. Revisiting the Cache Miss Analysis of Multithreaded Algorithms. In LATIN 2012: Theoretical Informatics - 10th Latin American Symposium, Arequipa, Peru, April 16-20, 2012. Proceedings. 172–183.Google ScholarGoogle Scholar
  19. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms(third ed.). The MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013. 261–272.Google ScholarGoogle Scholar
  21. David Dinh, Harsha Vardhan Simhadri, and Yuan Tang. 2016. Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers. In SPAA’16. Pacific Grove, CA, USA.Google ScholarGoogle Scholar
  22. Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1 (Jan. 2012), 4:1–4:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Matteo Frigo and Volker Strumpen. 2009. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput. Syst. 45, 2 (2009), 203–233.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write Asynchronous PRAM Model. Theor. Comput. Sci. 196, 1-2 (1998), 3–29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. 1998. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms. SIAM J. Comput. 28, 2 (1998), 733–769.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yan Gu. 2018. Improved Parallel Cache-Oblivious Algorithms for Dynamic Programming and Linear Algebra. CoRR abs/1809.09330(2018).Google ScholarGoogle Scholar
  27. Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Implementing Strassen’s Algorithm with BLIS. CoRR (2016).Google ScholarGoogle Scholar
  28. Intel Corporation 2010. Intel Cilk Plus Language Specification. Intel Corporation. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf.Google ScholarGoogle Scholar
  29. Joseph JáJá. 1992. An Introduction to Parallel Algorithms. Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Bharat Kumar, Chua-Huang Huang, P. Sadayappan, and Rodney W. Johnson. 1995. A Tensor Product Formulation of Strassen’s Matrix Multiplication Algorithm with Memory Reduction. Scientific Programming 4, 4 (1995), 275–289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Charles E. Leiserson. 2010. The Cilk++ Concurrency Platform. Journal of Supercomputing 51, 3 (March 2010), 244–257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: implementation and performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3 (1999), 287–297.Google ScholarGoogle ScholarCross RefCross Ref
  34. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing contention through priority updates. In SPAA. 152–163.Google ScholarGoogle Scholar
  35. Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium(IPDPS ’14). IEEE Computer Society, Washington, DC, USA, 1049–1059.Google ScholarGoogle Scholar
  36. Edgar Solomonik and James Demmel. 2011. Communication-optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II(Euro-Par’11). Springer-Verlag, Berlin, Heidelberg, 90–109.Google ScholarGoogle ScholarCross RefCross Ref
  37. Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures(SPAA ’09). ACM, New York, NY, USA, 91–100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Volker Strassen. 1969. Gaussian Elimination is not Optimal. Numer. Math. 14, 3 (1969), 354–356.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing
    August 2020
    186 pages
    ISBN:9781450388689
    DOI:10.1145/3409390

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 August 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate91of313submissions,29%
  • Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format