ABSTRACT
This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) of arbitrary dimension. Whereas conventional implementations of Ttm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (Gemm) implementation---our framework's strategy is to carry out the Ttm in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the Ttm's inputs. When compared to widely used single-node Ttm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (Ctf), In-TensLi's in-place and input-adaptive Ttm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.
- An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw., 28(2):135--151, June 2002. Google ScholarDigital Library
- E. Acar, C. Aykut-Bingol, H. Bingol, R. Bro, and B. Yener. Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10--i18, 2007. Google ScholarDigital Library
- E. Acar, S. A. Camtepe, M. S. Krishnamoorthy, and B. Yener. Modeling and multiway analysis of chatroom tensors. In Intelligence and Security Informatics, pages 256--268. Springer, 2005. Google ScholarDigital Library
- E. Acar, R. J. Harrison, F. Olken, O. Alter, M. Helal, L. Omberg, B. Bader, A. Kennedy, H. Park, Z. Bai, D. Kim, R. Plemmons, G. Beylkin, T. Kolda, S. Ragnarsson, L. Delathauwer, J. Langou, S. P. Ponnapalli, I. Dhillon, L.-h. Lim, J. R. Ramanujam, C. Ding, M. Mahoney, J. Raynolds, L. EldÃl'n, C. Martin, P. Regalia, P. Drineas, M. Mohlenkamp, C. Faloutsos, J. Morton, B. Savas, S. Friedland, L. Mullin, and C. Van Loan. Future directions in tensor-based computation and modeling, 2009.Google Scholar
- A. Auer and etc. Automatic code generation for many-body electronic structure methods: the tensor contrac. Molecular Physics, 104(2):211--228, 2006.Google ScholarCross Ref
- B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5. Available from http://www.sandia.gov/tgkolda/TensorToolbox/, January 2012.Google Scholar
- G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:pp. 1--155, 2014.Google ScholarCross Ref
- M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1--6, Sept 2012.Google ScholarCross Ref
- J. H. Choi and S. Vishwanathan. Dfacto: Distributed factorization of tensors. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1296--1304. Curran Associates, Inc., 2014.Google Scholar
- A. Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR, abs/1403.2048, 2014.Google Scholar
- K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3):12:1--12:25, May 2008. Google ScholarDigital Library
- L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31(4):2029--2054, May 2010. Google ScholarDigital Library
- L. Grasedyck, D. Kressner, and C. Tobler. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen, 36(1):53--78, 2013.Google ScholarCross Ref
- R. A. Harshman. Foundations of the parafac procedure: models and conditions for an" explanatory" multimodal factor analysis. 1970.Google Scholar
- J. C. Ho, J. Ghosh, and J. Sun. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 115--124, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- R. W. Hockney and I. J. Curington. f1/2 : A parameter to characterize memory and communication bottlenecks. Parallel Computing, 10:277--286, 1989.Google ScholarCross Ref
- Intel. Math kernel library. http://developer.intel.com/software/products/mkl/.Google Scholar
- I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos. Haten2: Billion-scale tensor decompositions. In ICDE, 2015.Google ScholarCross Ref
- M. Jiang, P. Cui, F. Wang, X. Xu, W. Zhu, and S. Yang. Fema: Flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 1186--1195, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12--16, 2012, pages 316--324, 2012. Google ScholarDigital Library
- T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455--500, 2009. Google ScholarDigital Library
- T. G. Kolda and J. Sun. Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM '08, pages 363--372, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- C.-F. V. Latchoumane, F.-B. Vialatte, J. Solé-Casals, M. Maurice, S. R. Wimalaratna, N. Hudson, J. Jeong, and A. Cichocki. Multiway array decomposition analysis of eegs in alzheimer's disease. Journal of neuroscience methods, 207(1):41--50, 2012.Google Scholar
- L. D. Lathauwer and J. Vandewalle. Dimensionality reduction in higher-order signal processing and rank- (r1,r2,...,rn) reduction in multilinear algebra. Linear Algebra and its Applications, 391(0):31--55, 2004. Special Issue on Linear Algebra in Signal and Image Processing.Google ScholarCross Ref
- J. Li, G. Tan, M. Chen, and N. Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 117--126, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Y. Matsubara, Y. Sakurai, W. G. van Panhuis, and C. Faloutsos. Funnel: Automatic mining of spatially coevolving epidemics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 105--114, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- J. Mocks. Topographic components model for event-related potentials and some biophysical considerations. Biomedical Engineering, IEEE Transactions on, 35(6):482--484, June 1988.Google Scholar
- M. Morup, L. K. Hansen, C. S. Herrmann, J. Parnas, and S. M. Arnfred. Parallel factor analysis as an exploratory tool for wavelet transformed event-related {EEG}. NeuroImage, 29(3):938--947, 2006.Google ScholarCross Ref
- J. Nagy and M. Kilmer. Kronecker product approximation for preconditioning in three-dimensional imaging applications. Image Processing, IEEE Transactions on, 15(3):604--613, March 2006. Google ScholarDigital Library
- I. V. Oseledets. Tensor-train decomposition. SIAM J. Scientific Computing, 33(5):2295--2317, 2011. Google ScholarDigital Library
- E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropolous. ParCube: Sparse parallelizable tensor decompositions. In Proceedings of the 2012 European Conference on Machine Learning Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages pp. 521--536, Bristol, United Kingdom, 2012. Google ScholarDigital Library
- A. Ramanathan, P. K. Agarwal, M. Kurnikova, and C. J. Langmead. An online approach for mining collective behaviors from molecular dynamics simulations, volume LNCS 5541, pages pp. 138--154. 2009. Google ScholarDigital Library
- N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis. Memory-efficient parallel computation of tensor and matrix products for big tensor decompositions. Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2014.Google ScholarCross Ref
- B. Savas and L. Eldén. Handwritten digit classification using higher order singular value decomposition. Pattern recognition, 40(3):993--1003, 2007. Google ScholarDigital Library
- A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I--42--I--49 vol.1, 2001.Google Scholar
- N. Sidiropoulos, R. Bro, and G. Giannakis. Parallel factor analysis in sensor array processing. Signal Processing, IEEE Transactions on, 48(8):2377--2388, Aug 2000. Google ScholarDigital Library
- N. Sidiropoulos, G. Giannakis, and R. Bro. Blind parafac receivers for ds-cdma systems. Signal Processing, IEEE Transactions on, 48(3):810--823, Mar 2000. Google ScholarDigital Library
- S. Smith, N. Ravindran, N. Sidiropoulos, and G. Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2015.Google ScholarDigital Library
- E. Solomonik, J. Demmel, and T. Hoefler. Communication lower bounds for tensor contraction algorithms. Technical report, ETH Zürich, 2015.Google Scholar
- E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. Technical Report UCB/EECS-2012-210, EECS Department, University of California, Berkeley, Nov 2012.Google Scholar
- J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374--383. ACM, 2006. Google ScholarDigital Library
- L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279--311, 1966.Google ScholarCross Ref
- F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software, 2013. Google ScholarDigital Library
- M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In Computer Vision-ECCV 2002, pages 447--460. Springer, 2002. Google ScholarDigital Library
- R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SuperComputing 1998: High Performance Networking and Computing, 1998. Google ScholarDigital Library
Index Terms
- An input-adaptive and in-place approach to dense tensor-times-matrix multiply
Recommendations
Optimizing sparse tensor times matrix on GPUs
AbstractThis work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, ...
Highlights- Designed an in-place SpTTM algorithm to avoid tensor-matrix data transformation.
Tensor Decompositions and Applications
This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have ...
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs
An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this article, we ...
Comments