Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms

Barlas, Gerassimos

doi:10.1007/978-3-030-86359-3_14

Gerassimos Barlas ORCID: orcid.org/0000-0002-9792-9638⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12942))

Included in the following conference series:

International Conference on Parallel Computing Technologies

874 Accesses

Abstract

In this paper we analytically solve the partitioning problem for dense matrix-matrix multiplication, running on a cluster of heterogeneous multicore machines, equipped with a variety of accelerators. Closed-form solutions are provided, that can yield an optimum partitioning in linear time with respect to the number of cores in the system.

We also show that a run-time, online calculation of system parameters for the application of DLT is feasible, allowing the easy deployment of DLT frameworks without a costly a-priori benchmarking procedure.

The paper concludes with an extensive experimental study that shows that our DLT framework coupled with online parameter calculation, can outperform dynamic partitioning while leveraging existing optimized Dense Linear Algebra (DLA) libraries, such as NVidia’s cuBLAS and Intel’s MKL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kang, S., Veeravalli, B., Aung, K.M.M.: Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. J. Parallel Distrib. Comput. 113, 1–16 (2018)
Article Google Scholar
Wang, X., Veeravalli, B.: Performance characterization on handling large-scale partitionable workloads on heterogeneous networked compute platforms. IEEE Trans. Parallel Distrib. Syst. 28(10), 2925–2938 (2017)
Article Google Scholar
Barlas, G.: Multicore and GPU Programming: An Integrated Approach, 1st edn. Morgan Kaufmann, Burlington (2014)
Google Scholar
Suresh, S., Run, C., Kim, H.J., Robertazzi, T.G., Kim, Y.-I.: Scheduling second-order computational load in master-slave paradigm. IEEE Trans. Aerosp. Electron. Syst. 48(1), 780–793 (2012)
Article Google Scholar
Song, F., Tomov, S., Dongarra, J.: Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In: ICS 2012, pp. 365–375 (2012)
Google Scholar
ul Hassan Khan, A., Al-Mouhamed, M., Fatayer, A., Mohammad, N.: Optimizing the matrix multiplication using Strassen and Winograd algorithms with limited recursions on many-core. Int. J. Parallel Program. 44(4), 801–830 (2016). https://doi.org/10.1007/s10766-015-0378-1
Kang, H., Kwon, H.C., Kim, D.: HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs. Computing 102(12), 2607–2631 (2020). https://doi.org/10.1007/s00607-020-00846-1
Article MathSciNet MATH Google Scholar
Kelefouras, V., Kritikakou, A., Mporas, I., Kolonias, V.: A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures. J. Supercomput. 72(3), 804–844 (2016). https://doi.org/10.1007/s11227-015-1613-7
Article Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms, Technical Report UCB/EECS-2011-10, University of California at Berkeley, February 2011
Google Scholar
Lazzaro, A., VandeVondele, J., Hutter, J., Schütt, O.: Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5D algorithm and one-sided MPI. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2017 (2017)
Google Scholar
Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)
Article Google Scholar
Abdelfattah, A., Keyes, D., Ltaief, H.: KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators. ACM Trans. Math. Softw. 42(3), 1–31 (2016)
Article MathSciNet Google Scholar
Sivkov, I., Lazzaro, A., Hutter, J.: DBCSR: a library for dense matrix multiplications on distributed GPU-accelerated systems. In: 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), pp. 0799–0803 (2019)
Google Scholar
Malik, T., Lastovetsky, A.: Optimal matrix partitioning for data parallel computing on hybrid heterogeneous platforms. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 1–11 (2020)
Google Scholar
Malik, T., Lastovetsky, A.: Towards optimal matrix partitioning for data parallel computing on a hybrid heterogeneous server. IEEE Access 9, 17229–17244 (2021)
Article Google Scholar
Barlas, G., Hiny, L.E.: Closed-form solutions for dense matrix-matrix multiplication on heterogeneous platforms using divisible load analysis. In: PDP, Cambridge, UK, pp. 376–384 (2018)
Google Scholar
Ghanbari, S., Othman, M.: Time cheating in divisible load scheduling: sensitivity analysis, results and open problems. In: 6th International Conference on Smart Computing and Communications, ICSCC 2017, pp. 935–943 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, College of Engineering, American University of Sharjah, POB 26666, Sharjah, UAE
Gerassimos Barlas

Authors

Gerassimos Barlas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerassimos Barlas .

Editor information

Editors and Affiliations

Institute of Computational Mathematics and Mathematical Geophysics SB RAS, Novosibirsk, Russia
Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barlas, G. (2021). Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2021. Lecture Notes in Computer Science(), vol 12942. Springer, Cham. https://doi.org/10.1007/978-3-030-86359-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-86359-3_14
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86358-6
Online ISBN: 978-3-030-86359-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics