research-article

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

Authors:

Casey Battaglino,

Ioakeim Perros,

Richard VuducAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 76, Pages 1 - 12

https://doi.org/10.1145/2807591.2807671

Published: 15 November 2015 Publication History

Abstract

This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) of arbitrary dimension. Whereas conventional implementations of Ttm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (Gemm) implementation---our framework's strategy is to carry out the Ttm in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the Ttm's inputs. When compared to widely used single-node Ttm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (Ctf), In-TensLi's in-place and input-adaptive Ttm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.

References

[1]

An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw., 28(2):135--151, June 2002.

Digital Library

[2]

E. Acar, C. Aykut-Bingol, H. Bingol, R. Bro, and B. Yener. Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10--i18, 2007.

Digital Library

[3]

E. Acar, S. A. Camtepe, M. S. Krishnamoorthy, and B. Yener. Modeling and multiway analysis of chatroom tensors. In Intelligence and Security Informatics, pages 256--268. Springer, 2005.

Digital Library

[4]

E. Acar, R. J. Harrison, F. Olken, O. Alter, M. Helal, L. Omberg, B. Bader, A. Kennedy, H. Park, Z. Bai, D. Kim, R. Plemmons, G. Beylkin, T. Kolda, S. Ragnarsson, L. Delathauwer, J. Langou, S. P. Ponnapalli, I. Dhillon, L.-h. Lim, J. R. Ramanujam, C. Ding, M. Mahoney, J. Raynolds, L. EldÃl'n, C. Martin, P. Regalia, P. Drineas, M. Mohlenkamp, C. Faloutsos, J. Morton, B. Savas, S. Friedland, L. Mullin, and C. Van Loan. Future directions in tensor-based computation and modeling, 2009.

[5]

A. Auer and etc. Automatic code generation for many-body electronic structure methods: the tensor contrac. Molecular Physics, 104(2):211--228, 2006.

[6]

B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5. Available from http://www.sandia.gov/tgkolda/TensorToolbox/, January 2012.

[7]

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:pp. 1--155, 2014.

[8]

M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1--6, Sept 2012.

[9]

J. H. Choi and S. Vishwanathan. Dfacto: Distributed factorization of tensors. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1296--1304. Curran Associates, Inc., 2014.

[10]

A. Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR, abs/1403.2048, 2014.

[11]

K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3):12:1--12:25, May 2008.

Digital Library

[12]

L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31(4):2029--2054, May 2010.

Digital Library

[13]

L. Grasedyck, D. Kressner, and C. Tobler. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen, 36(1):53--78, 2013.

[14]

R. A. Harshman. Foundations of the parafac procedure: models and conditions for an" explanatory" multimodal factor analysis. 1970.

[15]

J. C. Ho, J. Ghosh, and J. Sun. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 115--124, New York, NY, USA, 2014. ACM.

Digital Library

[16]

R. W. Hockney and I. J. Curington. f_1/2 : A parameter to characterize memory and communication bottlenecks. Parallel Computing, 10:277--286, 1989.

[17]

Intel. Math kernel library. http://developer.intel.com/software/products/mkl/.

[18]

I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos. Haten2: Billion-scale tensor decompositions. In ICDE, 2015.

[19]

M. Jiang, P. Cui, F. Wang, X. Xu, W. Zhu, and S. Yang. Fema: Flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 1186--1195, New York, NY, USA, 2014. ACM.

Digital Library

[20]

U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12--16, 2012, pages 316--324, 2012.

Digital Library

[21]

T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455--500, 2009.

Digital Library

[22]

T. G. Kolda and J. Sun. Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM '08, pages 363--372, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[23]

C.-F. V. Latchoumane, F.-B. Vialatte, J. Solé-Casals, M. Maurice, S. R. Wimalaratna, N. Hudson, J. Jeong, and A. Cichocki. Multiway array decomposition analysis of eegs in alzheimer's disease. Journal of neuroscience methods, 207(1):41--50, 2012.

[24]

L. D. Lathauwer and J. Vandewalle. Dimensionality reduction in higher-order signal processing and rank- (r1,r2,...,rn) reduction in multilinear algebra. Linear Algebra and its Applications, 391(0):31--55, 2004. Special Issue on Linear Algebra in Signal and Image Processing.

[25]

J. Li, G. Tan, M. Chen, and N. Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 117--126, New York, NY, USA, 2013. ACM.

Digital Library

[26]

Y. Matsubara, Y. Sakurai, W. G. van Panhuis, and C. Faloutsos. Funnel: Automatic mining of spatially coevolving epidemics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 105--114, New York, NY, USA, 2014. ACM.

Digital Library

[27]

J. Mocks. Topographic components model for event-related potentials and some biophysical considerations. Biomedical Engineering, IEEE Transactions on, 35(6):482--484, June 1988.

[28]

M. Morup, L. K. Hansen, C. S. Herrmann, J. Parnas, and S. M. Arnfred. Parallel factor analysis as an exploratory tool for wavelet transformed event-related {EEG}. NeuroImage, 29(3):938--947, 2006.

[29]

J. Nagy and M. Kilmer. Kronecker product approximation for preconditioning in three-dimensional imaging applications. Image Processing, IEEE Transactions on, 15(3):604--613, March 2006.

Digital Library

[30]

I. V. Oseledets. Tensor-train decomposition. SIAM J. Scientific Computing, 33(5):2295--2317, 2011.

Digital Library

[31]

E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropolous. ParCube: Sparse parallelizable tensor decompositions. In Proceedings of the 2012 European Conference on Machine Learning Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages pp. 521--536, Bristol, United Kingdom, 2012.

Digital Library

[32]

A. Ramanathan, P. K. Agarwal, M. Kurnikova, and C. J. Langmead. An online approach for mining collective behaviors from molecular dynamics simulations, volume LNCS 5541, pages pp. 138--154. 2009.

Digital Library

[33]

N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis. Memory-efficient parallel computation of tensor and matrix products for big tensor decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2014.

[34]

B. Savas and L. Eldén. Handwritten digit classification using higher order singular value decomposition. Pattern recognition, 40(3):993--1003, 2007.

Digital Library

[35]

A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I--42--I--49 vol.1, 2001.

[36]

N. Sidiropoulos, R. Bro, and G. Giannakis. Parallel factor analysis in sensor array processing. Signal Processing, IEEE Transactions on, 48(8):2377--2388, Aug 2000.

Digital Library

[37]

N. Sidiropoulos, G. Giannakis, and R. Bro. Blind parafac receivers for ds-cdma systems. Signal Processing, IEEE Transactions on, 48(3):810--823, Mar 2000.

Digital Library

[38]

S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2015.

Digital Library

[39]

E. Solomonik, J. Demmel, and T. Hoefler. Communication lower bounds for tensor contraction algorithms. Technical report, ETH Zürich, 2015.

[40]

E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. Technical Report UCB/EECS-2012-210, EECS Department, University of California, Berkeley, Nov 2012.

[41]

J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374--383. ACM, 2006.

Digital Library

[42]

L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279--311, 1966.

[43]

F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2013.

Digital Library

[44]

M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In Computer Vision-ECCV 2002, pages 447--460. Springer, 2002.

Digital Library

[45]

R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SuperComputing 1998: High Performance Networking and Computing, 1998.

Digital Library

Cited By

Krastev ASamardzic NLangowski SDevadas SSanchez D(2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656382
Wu DMeng JZhu WDeng MWang XLuo TWahib MWei Y(2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00027
Li MJin YYu D(2024)Tensor tucker decomposition accelerated on FPGA for convolution layer compress2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)10.1109/ECIE61885.2024.10626820(536-542)Online publication date: 24-May-2024
https://doi.org/10.1109/ECIE61885.2024.10626820
Show More Cited By

Index Terms

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

Recommendations

Optimizing sparse tensor times matrix on GPUs
Abstract
This work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, ...
Highlights
- Designed an in-place SpTTM algorithm to avoid tensor-matrix data transformation.
Tensor Decompositions and Applications

This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have ...
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs
An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this article, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Krastev ASamardzic NLangowski SDevadas SSanchez D(2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656382
Wu DMeng JZhu WDeng MWang XLuo TWahib MWei Y(2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00027
Li MJin YYu D(2024)Tensor tucker decomposition accelerated on FPGA for convolution layer compress2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)10.1109/ECIE61885.2024.10626820(536-542)Online publication date: 24-May-2024
https://doi.org/10.1109/ECIE61885.2024.10626820
Başsoy C(2024)Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLASComputational Science – ICCS 202410.1007/978-3-031-63749-0_18(256-271)Online publication date: 28-Jun-2024
https://doi.org/10.1007/978-3-031-63749-0_18
Jang JKang U(2023)Static and Streaming Tucker Decomposition for Dense TensorsACM Transactions on Knowledge Discovery from Data10.1145/356868217:5(1-34)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3568682
Faingnaert TBesard TDe Sutter B(2022)Flexible Performant GEMM Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313645733:9(2230-2248)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TPDS.2021.3136457
Ravishankar JSharma M(2022)An integrated learning and approximation scheme for coding of static or dynamic light fields based on hybrid Tucker–Karhunen–Loève transform‐singular value decomposition via tensor double sketchingIET Signal Processing10.1049/sil2.1214116:6(680-694)Online publication date: 29-Jun-2022
https://doi.org/10.1049/sil2.12141
Duan LXiao CLi MDing MYang C(2022)a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUsCCF Transactions on High Performance Computing10.1007/s42514-022-00119-75:1(12-25)Online publication date: 11-Aug-2022
https://doi.org/10.1007/s42514-022-00119-7
Caliari MCassini FZivcovich F(2022)A μ-mode BLAS approach for multidimensional tensor-structured problemsNumerical Algorithms10.1007/s11075-022-01399-492:4(2483-2508)Online publication date: 4-Oct-2022
https://doi.org/10.1007/s11075-022-01399-4
Li ZFang QBallard G(2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472472
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten