HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Kang, Homin; Kwon, Hyuck Chan; Kim, Duksu

doi:10.1007/s00607-020-00846-1

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Regular Paper
Published: 11 October 2020

Volume 102, pages 2607–2631, (2020)
Cite this article

Computing Aims and scope Submit manuscript

511 Accesses
7 Citations
Explore all metrics

Abstract

We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen’s method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

GPU Architecture

MT-3000: a heterogeneous multi-zone processor for HPC

Article 24 May 2022

Notes

Available at https://sites.google.com/view/hpclab/ip/datasets.

References

AMD Ryzen series https://www.amd.com/en/products/embedded-ryzen-series (2020)
Abdelfattah A, Tomov S, Dongarra J (2019) Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), IEEE, pp 111–122
Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582
Article Google Scholar
Ballard G, Demmel J, Holtz O, Lipshitz B, Schwartz O (2012) Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the twenty-fourth annual ACM symposium on parallelism in algorithms and architectures, ACM, pp 193–204
Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 cublas for graphics processors. In: 2008 IEEE international symposium on parallel and distributed processing, IEEE, pp 1–8
Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, Van De Geijn RA (1997) Parallel implementation of blas: general techniques for level 3 blas. Concurr Pract Exp 9(9):837–857
Article Google Scholar
Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on theory of computing, ACM, pp 1–6
Dagum L, Menon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5:46–55
Article Google Scholar
Fatahalian K, Sugerman J, Hanrahan P (2004) Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ACM, pp 133–137
Irony D, Toledo S, Tiskin A (2004) Communication lower bounds for distributed-memory matrix multiplication. J Parallel Distrib Comput 64(9):1017–1026
Article Google Scholar
Itu L, Moldoveanu F, Suciu C, Postelnicu A (2012) Gpu enhanced stream-based matrix multiplication. Bull Transil Univ Brasov Eng Sci Ser I 5(2):79
Google Scholar
Karunadasa N, Ranasinghe D (2009) On the comparative performance of parallel algorithms on small GPU/CUDA clusters. In: International conference on high performance computing
Kelefouras V, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using simd. J Supercomput 68(3):1418–1440
Article Google Scholar
Kim D, Heo JP, Huh J, Kim J, Yoon SE (2009) HPCCD: Hybrid parallel continuous collision detection. Comput Graph Forum (Pac Graph) 28(7)
Kim D, Lee J, Lee J, Shin I, Kim J, Yoon SE (2013) Scheduling in heterogeneous computing environments for proximity queries. IEEE Trans Vis Comput Graph 19(9):1513–1525
Article Google Scholar
Lai PW, Arafat H, Elango V, Sadayappan P (2013) Accelerating Strassen–Winograd’s matrix multiplication algorithm on GPUs. In: 20th international conference on high performance computing (HiPC), IEEE, pp 139–148
Li A, Serban R, Negrut D (2014) An overview of nvidia tegra k1 architecture. http://sbel.wisc.edu/documents/TR-2014-17
Lipshitz B, Ballard G, Demmel J, Schwartz O (2012) Communication-avoiding parallel strassen: implementation and performance. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, p 101
Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: IEEE 28th international parallel and distributed processing symposium, IEEE, pp 370–381
NVIDIA: CUBLAS libraries https://developer.nvidia.com/cublas (2018)
NVIDIA: CUDA programming guide 9.2 (2018)
Ohtaki Y, Takahashi D, Boku T, Sato M (2004) Parallel implementation of strassen’s matrix multiplication algorithm for heterogeneous clusters. In: 18th international on parallel and distributed processing symposium, Proceedings, IEEE, p 112
Pan VY (1979) Field extension and trilinear aggregating, uniting and canceling for the acceleration of matrix multiplications. In: IEEE foundations of computer science, IEEE, pp 28–38
Pan VY (1980) New fast algorithms for matrix operations. SIAM J Comput 9(2):321–342
Article MathSciNet Google Scholar
Ray U, Hazra TK, Ray UK (2016) Matrix multiplication using Strassen’s algorithm on CPU & GPU
Ryu S, Kim D (2018) Parallel huge matrix multiplication on a cluster with gpgpu accelerators. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE, pp 877–882
Saravanan V, Radhakrishnan M, Basavesh A, Kothari D (2012) A comparative study on performance benefits of multi-core cpus using openmp. Int J Comput Sci Issue (IJCSI) 9(1):272
Google Scholar
Schönhage A (1981) Partial and total matrix multiplication. SIAM J Comput 10(3):434–455
Article MathSciNet Google Scholar
Strassen V (1969) Gaussian elimination is not optimal. Numerische mathematik 13(4):354–356
Article MathSciNet Google Scholar
Strassen V (1986) The asymptotic spectrum of tensors and the exponent of matrix multiplication. In: 27th annual symposium on foundations of computer science, IEEE, pp 49–54
Sun Y, Tong Y (2010) Cuda based fast implementation of very large matrix computation. In: 2010 international conference on parallel and distributed computing, applications and technologies, pp 487–491. https://doi.org/10.1109/PDCAT.2010.45
Sunitha N, Raju K, Chiplunkar NN (2017) Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. In: international conference on inventive communication and computational technologies (ICICCT), pp 211–215
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: International conference for high performance computing, networking, storage and analysis, IEEE, pp 1–11
Yugopuspito P, Sutrisno Hudi R (2013) Breaking through memory limitation in GPU parallel processing using strassen algorithm. In: International conference on computer, control, informatics and its applications (IC3INA), pp 201–205
Zhang P, Gao Y (2015) Matrix multiplication on high-density multi-gpu architectures: theoretical and experimental investigations. In: International conference on high performance computing, Springer, pp 17–30

Download references

Author information

Authors and Affiliations

Korea University of Technology and Education (KOREATECH), Cheonan, South Korea
Homin Kang, Hyuck Chan Kwon & Duksu Kim

Authors

Homin Kang
View author publications
You can also search for this author in PubMed Google Scholar
Hyuck Chan Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Duksu Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duksu Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No.2018R1C1B5045551).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, H., Kwon, H.C. & Kim, D. HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs. Computing 102, 2607–2631 (2020). https://doi.org/10.1007/s00607-020-00846-1

Download citation

Received: 06 April 2020
Accepted: 25 September 2020
Published: 11 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00607-020-00846-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

GPU Architecture

MT-3000: a heterogeneous multi-zone processor for HPC

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

GPU Architecture

MT-3000: a heterogeneous multi-zone processor for HPC

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation