High Performance Tensor–Vector Multiplication on Shared-Memory Systems

Pawłowski, Filip; Uçar, Bora; Yzelman, Albert-Jan

doi:10.1007/978-3-030-43229-4_4

Filip Pawłowski^12,13,
Bora Uçar ORCID: orcid.org/0000-0002-4960-3545^13,14 &
Albert-Jan Yzelman¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1085 Accesses

Abstract

Tensor–vector multiplication is one of the core components in tensor computations. We have recently investigated high performance, single core implementation of this bandwidth-bound operation. Here, we investigate its efficient, shared-memory implementations. Upon carefully analyzing the design space, we implement a number of alternatives using OpenMP and compare them experimentally. Experimental results on up to 8 socket systems show near peak performance for the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Design of a High-Performance Tensor-Vector Multiplication with BLAS

A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming

Article 28 November 2021

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

References

Bader, B.W., Kolda, T.G.: Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM TOMS 32(4), 635–653 (2006)
Article MathSciNet Google Scholar
Ballard, G., Knight, N., Rouse, K.: Communication lower bounds for matricized tensor times Khatri-Rao product. In: 2018 IPDPS, pp. 557–567. IEEE (2018)
Google Scholar
Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. Syst. Appl. 1(3), 12–21 (1993)
Article Google Scholar
Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77:1–77:29 (2017)
Article Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet Google Scholar
Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: SC 2015, pp. 76:1–76:12 (2015)
Google Scholar
Matthews, D.: High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40(1), C1–C24 (2018)
Article MathSciNet Google Scholar
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing (1966)
Google Scholar
Pawłowski, F., Uçar, B., Yzelman, A.J.N.: High performance tensor-vector multiples on shared memory systems. Technical report 9274, Inria, Grenoble-Rhône-Alpes (2019)
Google Scholar
Pawlowski, F., Uçar, B., Yzelman, A.N.: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations. J. Comput. Sci. (2019). https://doi.org/10.1016/j.jocs.2019.02.007
Article MathSciNet Google Scholar
Solomonik, E., Matthews, D., Hammond, J.R., Stanton, J.F., Demmel, J.: A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distrib. Comput. 74(12), 3176–3190 (2014)
Article Google Scholar
Springer, P., Bientinesi, P.: Design of a high-performance GEMM-like tensor-tensor multiplication. ACM TOMS 44(3), 1–29 (2018)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Huawei Technologies France, 20 Quai du Point du Jour, 92100, Boulogne-Billancourt, France
Filip Pawłowski & Albert-Jan Yzelman
ENS Lyon, Lyon, France
Filip Pawłowski & Bora Uçar
CNRS and LIP (UMR5668, CNRS - ENS Lyon - UCB Lyon 1 - INRIA), Lyon, France
Bora Uçar

Authors

Filip Pawłowski
View author publications
You can also search for this author in PubMed Google Scholar
Bora Uçar
View author publications
You can also search for this author in PubMed Google Scholar
Albert-Jan Yzelman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filip Pawłowski .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pawłowski, F., Uçar, B., Yzelman, AJ. (2020). High Performance Tensor–Vector Multiplication on Shared-Memory Systems. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-43229-4_4
Published: 19 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics