skip to main content
10.1145/2807591.2807671acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

Published: 15 November 2015 Publication History

Abstract

This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) of arbitrary dimension. Whereas conventional implementations of Ttm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (Gemm) implementation---our framework's strategy is to carry out the Ttm in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the Ttm's inputs. When compared to widely used single-node Ttm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (Ctf), In-TensLi's in-place and input-adaptive Ttm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.

References

[1]
An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw., 28(2):135--151, June 2002.
[2]
E. Acar, C. Aykut-Bingol, H. Bingol, R. Bro, and B. Yener. Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10--i18, 2007.
[3]
E. Acar, S. A. Camtepe, M. S. Krishnamoorthy, and B. Yener. Modeling and multiway analysis of chatroom tensors. In Intelligence and Security Informatics, pages 256--268. Springer, 2005.
[4]
E. Acar, R. J. Harrison, F. Olken, O. Alter, M. Helal, L. Omberg, B. Bader, A. Kennedy, H. Park, Z. Bai, D. Kim, R. Plemmons, G. Beylkin, T. Kolda, S. Ragnarsson, L. Delathauwer, J. Langou, S. P. Ponnapalli, I. Dhillon, L.-h. Lim, J. R. Ramanujam, C. Ding, M. Mahoney, J. Raynolds, L. EldÃl'n, C. Martin, P. Regalia, P. Drineas, M. Mohlenkamp, C. Faloutsos, J. Morton, B. Savas, S. Friedland, L. Mullin, and C. Van Loan. Future directions in tensor-based computation and modeling, 2009.
[5]
A. Auer and etc. Automatic code generation for many-body electronic structure methods: the tensor contrac. Molecular Physics, 104(2):211--228, 2006.
[6]
B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5. Available from http://www.sandia.gov/tgkolda/TensorToolbox/, January 2012.
[7]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:pp. 1--155, 2014.
[8]
M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1--6, Sept 2012.
[9]
J. H. Choi and S. Vishwanathan. Dfacto: Distributed factorization of tensors. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1296--1304. Curran Associates, Inc., 2014.
[10]
A. Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR, abs/1403.2048, 2014.
[11]
K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3):12:1--12:25, May 2008.
[12]
L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31(4):2029--2054, May 2010.
[13]
L. Grasedyck, D. Kressner, and C. Tobler. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen, 36(1):53--78, 2013.
[14]
R. A. Harshman. Foundations of the parafac procedure: models and conditions for an" explanatory" multimodal factor analysis. 1970.
[15]
J. C. Ho, J. Ghosh, and J. Sun. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 115--124, New York, NY, USA, 2014. ACM.
[16]
R. W. Hockney and I. J. Curington. f1/2 : A parameter to characterize memory and communication bottlenecks. Parallel Computing, 10:277--286, 1989.
[17]
Intel. Math kernel library. http://developer.intel.com/software/products/mkl/.
[18]
I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos. Haten2: Billion-scale tensor decompositions. In ICDE, 2015.
[19]
M. Jiang, P. Cui, F. Wang, X. Xu, W. Zhu, and S. Yang. Fema: Flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 1186--1195, New York, NY, USA, 2014. ACM.
[20]
U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12--16, 2012, pages 316--324, 2012.
[21]
T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455--500, 2009.
[22]
T. G. Kolda and J. Sun. Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM '08, pages 363--372, Washington, DC, USA, 2008. IEEE Computer Society.
[23]
C.-F. V. Latchoumane, F.-B. Vialatte, J. Solé-Casals, M. Maurice, S. R. Wimalaratna, N. Hudson, J. Jeong, and A. Cichocki. Multiway array decomposition analysis of eegs in alzheimer's disease. Journal of neuroscience methods, 207(1):41--50, 2012.
[24]
L. D. Lathauwer and J. Vandewalle. Dimensionality reduction in higher-order signal processing and rank- (r1,r2,...,rn) reduction in multilinear algebra. Linear Algebra and its Applications, 391(0):31--55, 2004. Special Issue on Linear Algebra in Signal and Image Processing.
[25]
J. Li, G. Tan, M. Chen, and N. Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 117--126, New York, NY, USA, 2013. ACM.
[26]
Y. Matsubara, Y. Sakurai, W. G. van Panhuis, and C. Faloutsos. Funnel: Automatic mining of spatially coevolving epidemics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 105--114, New York, NY, USA, 2014. ACM.
[27]
J. Mocks. Topographic components model for event-related potentials and some biophysical considerations. Biomedical Engineering, IEEE Transactions on, 35(6):482--484, June 1988.
[28]
M. Morup, L. K. Hansen, C. S. Herrmann, J. Parnas, and S. M. Arnfred. Parallel factor analysis as an exploratory tool for wavelet transformed event-related {EEG}. NeuroImage, 29(3):938--947, 2006.
[29]
J. Nagy and M. Kilmer. Kronecker product approximation for preconditioning in three-dimensional imaging applications. Image Processing, IEEE Transactions on, 15(3):604--613, March 2006.
[30]
I. V. Oseledets. Tensor-train decomposition. SIAM J. Scientific Computing, 33(5):2295--2317, 2011.
[31]
E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropolous. ParCube: Sparse parallelizable tensor decompositions. In Proceedings of the 2012 European Conference on Machine Learning Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages pp. 521--536, Bristol, United Kingdom, 2012.
[32]
A. Ramanathan, P. K. Agarwal, M. Kurnikova, and C. J. Langmead. An online approach for mining collective behaviors from molecular dynamics simulations, volume LNCS 5541, pages pp. 138--154. 2009.
[33]
N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis. Memory-efficient parallel computation of tensor and matrix products for big tensor decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2014.
[34]
B. Savas and L. Eldén. Handwritten digit classification using higher order singular value decomposition. Pattern recognition, 40(3):993--1003, 2007.
[35]
A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I--42--I--49 vol.1, 2001.
[36]
N. Sidiropoulos, R. Bro, and G. Giannakis. Parallel factor analysis in sensor array processing. Signal Processing, IEEE Transactions on, 48(8):2377--2388, Aug 2000.
[37]
N. Sidiropoulos, G. Giannakis, and R. Bro. Blind parafac receivers for ds-cdma systems. Signal Processing, IEEE Transactions on, 48(3):810--823, Mar 2000.
[38]
S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2015.
[39]
E. Solomonik, J. Demmel, and T. Hoefler. Communication lower bounds for tensor contraction algorithms. Technical report, ETH Zürich, 2015.
[40]
E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. Technical Report UCB/EECS-2012-210, EECS Department, University of California, Berkeley, Nov 2012.
[41]
J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374--383. ACM, 2006.
[42]
L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279--311, 1966.
[43]
F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2013.
[44]
M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In Computer Vision-ECCV 2002, pages 447--460. Springer, 2002.
[45]
R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SuperComputing 1998: High Performance Networking and Computing, 1998.

Cited By

View all
  • (2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
  • (2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
  • (2024)Tensor tucker decomposition accelerated on FPGA for convolution layer compress2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)10.1109/ECIE61885.2024.10626820(536-542)Online publication date: 24-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code generation
  2. multilinear algebra
  3. offline autotuning
  4. tensor operation

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
  • (2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
  • (2024)Tensor tucker decomposition accelerated on FPGA for convolution layer compress2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)10.1109/ECIE61885.2024.10626820(536-542)Online publication date: 24-May-2024
  • (2024)Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLASComputational Science – ICCS 202410.1007/978-3-031-63749-0_18(256-271)Online publication date: 28-Jun-2024
  • (2023)Static and Streaming Tucker Decomposition for Dense TensorsACM Transactions on Knowledge Discovery from Data10.1145/356868217:5(1-34)Online publication date: 27-Feb-2023
  • (2022)Flexible Performant GEMM Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.313645733:9(2230-2248)Online publication date: 1-Sep-2022
  • (2022)An integrated learning and approximation scheme for coding of static or dynamic light fields based on hybrid Tucker–Karhunen–Loève transform‐singular value decomposition via tensor double sketchingIET Signal Processing10.1049/sil2.1214116:6(680-694)Online publication date: 29-Jun-2022
  • (2022)a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUsCCF Transactions on High Performance Computing10.1007/s42514-022-00119-75:1(12-25)Online publication date: 11-Aug-2022
  • (2022)A μ-mode BLAS approach for multidimensional tensor-structured problemsNumerical Algorithms10.1007/s11075-022-01399-492:4(2483-2508)Online publication date: 4-Oct-2022
  • (2021)Parallel Tucker Decomposition with Numerically Accurate SVDProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472472(1-11)Online publication date: 9-Aug-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media