Skip to main content
Log in

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

An Erratum to this article was published on 23 November 2015

Abstract

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the \(N \times N\) sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying \(N \ge 2048\) and \(N \ge 3072\), respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20\(\times \) and 80\(\times \) for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14–26 % for \(N \ge 1024\). Proposed implementations achieve 2.35 TFLOPS (67 % of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Al-Mouhamed, M., ul Hassan Khan, A.: Exploration of automatic optimization for CUDA programming. Int. J. Parallel Emerg. Distrib. Syst. pp. 1–16 (2014)

  2. Badin, M., D’Alberto, P., Bic, L., Dillencourt, M., Nicolau, A.: Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms. In: Proceedings of the 27th International ACM Conference on Supercomputing, pp. 213–222. ACM (2013)

  3. Bailey, D.H.: Extra high speed matrix multiplication on the cray-2. SIAM J. Sci. Stat. Comput. 9(3), 603–607 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  4. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pp. 193–204. ACM (2012)

  5. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Communication costs of Strassen’s matrix multiplication. Commun. ACM 57(2), 107–114 (2014)

    Article  MATH  Google Scholar 

  6. Chen, C., Taha, T.: A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster. Cluster Comput. 17(2), 327–337 (2014)

    Article  Google Scholar 

  7. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC ’87, pp. 1–6. ACM (1987)

  8. Costarelli, S., Storti, M., Paz, R., Dalcin, L., Idelsohn, S.: GPGPU implementation of the BFECC algorithm for pure advection equations. Cluster Comput. 17(2), 243–254 (2014)

    Article  Google Scholar 

  9. Cui, X., Chen, Y., Zhang, C., Mei, H.: Auto-tuning dense matrix multiplication for GPGPU with cache. In: IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 237–242 (2010)

  10. Dumitrescu, B., Roch, J.L., Trystram, D.: Fast matrix multiplication algorithms on MIMD architectures. Parallel Algorithms Appl. 4(1–2), 53–70 (1994)

    Article  MATH  Google Scholar 

  11. Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: International Symposium on Parallel and Distributed Processing, pp. 126–137 (2013)

  12. Intel Corporation: Intel Knights Corner: Software Developer Guide (2012)

  13. Intel Corporation: Intel Xeon Phi: Coprocessor Instruction Set Architecture, Reference Manual (2012)

  14. Kaporin, I.: A practical algorithm for faster matrix multiplication. Numer. Linear Algebra Appl. 6(8), 687–700 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kirk, D.B., Hwu, W.m.W.: Programming Massively Parallel Processors: A Hands-on Approach, 1st edn. Morgan Kaufmann Pub. (2010)

  16. Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for Fermi. Tech. Rep. 245, LAPACK Working Note (2011)

  17. Lai, P.W., Arafat, H., Elango, V., Sadayappan, P.: Accelerating Strassen-Winograd’s matrix multiplication algorithm on GPUs. In: 20th International Conference on High Performance Computing (HiPC), 2013, pp. 139–148 (2013)

  18. Lee, C., Ro, W., Gaudiot, J.L.: Boosting CUDA applications with CPUGPU hybrid computing. Int. J. Parallel Program. 42(2), 384–404 (2014)

    Article  Google Scholar 

  19. Li, J., Ranka, S., Sahni, S.: Strassen’s Matrix Multiplication on GPUs. In: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, ICPADS ’11, pp. 157–164 (2011)

  20. Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science, ICCS ’09, pp. 884–892 (2009)

  21. Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O.: Communication-avoiding parallel Strassen: Implementation and performance. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 101:1–11 (2012)

  22. Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for fermi graphics. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)

    Article  Google Scholar 

  23. NVIDIA: CUBLAS (2013). https://developer.nvidia.com/cuBLAS

  24. Pan, V.Y.: How to Multiply Matrices Faster. Lecture Notes in Computer Science. vol. 179. Springer (1984)

  25. Reinders, J.: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors. Intel Corporation, Santa Clara (2012)

    Google Scholar 

  26. Robinson, S.: Toward an optimal algorithm for matrix multiplication. SIAM News 38(9), 1–3 (2005)

    Google Scholar 

  27. Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 13(4), 354–356 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  28. Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pp. 31:1–11 (2008)

  29. Wei, S.C., Huang, B.: Accelerating volkov’s hybrid implementation of cholesky factorization on a fermi gpu. In: Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pp. 896–900. IEEE (2012)

  30. Williams, V.: Multiplying matrices in \(o(n^{2.373})\) time. Stanford University (2014). http://theory.stanford.edu/~virgi/matrixmult-f.pdf

  31. Winograd, S.: Some remarks on fast multiplication of polynomials. Complexity of Sequential and Parallel Numerical Algorithms p. 181 (1973)

  32. Yang, Y., Zhou, H.: The implementation of a high performance GPGPU compiler. Int. J. Parallel Program. 41(6), 768–781 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayaz ul Hassan Khan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, A.u.H., Al-Mouhamed, M., Fatayer, A. et al. Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core. Int J Parallel Prog 44, 801–830 (2016). https://doi.org/10.1007/s10766-015-0378-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-015-0378-1

Keywords

Navigation