Skip to main content
Log in

Mining billion-scale tensors: algorithms and discoveries

The VLDB Journal Aims and scope Submit manuscript

Abstract

How can we analyze large-scale real-world data with various attributes? Many real-world data (e.g., network traffic logs, web data, social networks, knowledge bases, and sensor streams) with multiple attributes are represented as multi-dimensional arrays, called tensors. For analyzing a tensor, tensor decompositions are widely used in many data mining applications: detecting malicious attackers in network traffic logs (with source IP, destination IP, port-number, timestamp), finding telemarketers in a phone call history (with sender, receiver, date), and identifying interesting concepts in a knowledge base (with subject, object, relation). However, current tensor decomposition methods do not scale to large and sparse real-world tensors with millions of rows and columns and ‘fibers.’ In this paper, we propose HaTen2, a distributed method for large-scale tensor decompositions that runs on the MapReduce framework. Our careful design and implementation of HaTen2 dramatically reduce the size of intermediate data and the number of jobs leading to achieve high scalability compared with the state-of-the-art method. Thanks to HaTen2, we analyze big real-world sparse tensors that cannot be handled by the current state of the art, and discover hidden concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., E.R.H. Jr., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)

  2. Kolda, T.G., Bader, B.W.: The tophits model for higher-order web link analysis. In: Workshop on Link Analysis, Counterterrorism and Security, Vol. 7, pp. 26–29 (2006)

  3. Maruhashi, K., Guo, F., Faloutsos, C.: Multiaspectforensics: Pattern mining on large-scale heterogeneous networks with tensor analysis. In: Proceedings of the Third International Conference on Advances in Social Network Analysis and Mining (2011)

  4. Sun, J., Papadimitriou, S., Yu, P.S.: Window-based tensor analysis on high-dimensional and multi-aspect streams. In: ICDM (2006)

  5. Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM, pp. 363–372 (2008)

  6. Davidson, I.N., Gilpin, S., Carmichael, O.T., Walker, P.B.: Network discovery via constrained tensor analysis of fmri data. In: KDD, pp. 194–202, ACM, New York (2013)

  7. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: Dynamic tensor analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, pp. 374–383. ACM, New York (2006)

  8. Hadoop information. http://hadoop.apache.org/

  9. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, Dec (2004)

  10. Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: Billion-scale tensor decompositions. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, April 13–17, 2015, pp. 1047–1058 (2015)

  11. Harshman, R.: Foundations of the parafac procedure: model and conditions for an explanatory multi-mode factor analysis. In: UCLA Working Papers in Phonetics, Vol. 16, pp. 1–84 (1970)

  12. Tomasi, G., Bro, R.: A comparison of algorithms for fitting the parafac model. Comput. Stat. Data Anal. 50(7), 1700–1734 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966c)

    Article  MathSciNet  Google Scholar 

  14. Andersson, C.A., Bro, R.: Improving the speed of multi-way algorithms: Part I. Tucker3. Chemometr. Intell. Lab. Syst. 42, 93–103 (1998)

    Article  Google Scholar 

  15. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, pp. 556–562 (2000)

  16. Chen, D., Plemmons, R.J.: Nonnegativity constraints in numerical analysis. In: Symposium on the Birth of Numerical Analysis (2007)

  17. Kim, Y.D., Choi, S.: Nonnegative tucker decomposition. In: CVPR, IEEE Computer Society (2007)

  18. Kang, U., Papalexakis, E.E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times—algorithms and discoveries. In: KDD, pp. 316–324 (2012)

  19. Freebase dataset. https://www.freebase.com/

  20. Darpa 1998 dataset. http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/data/1998data.html

  21. Bader, B.W., Kolda, T.G., et al.: Matlab tensor toolbox version 2.5, January 2012

  22. Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., Yener, B.: Multiway analysis of epilepsy tensors. Bioinformatics 23(13), i10–i18 (2007)

    Article  Google Scholar 

  23. Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Parcube: sparse parallelizable tensor decompositions. In: Machine Learning and Knowledge Discovery in Databases, pp. 521–536. Springer, Berlin (2012)

  24. Papalexakis, E.E., Akoglu, L., Ienco, D.: Do more views of a graph help? community detection and clustering in multi-graphs. In: 16th International Conference on Information Fusion (FUSION), 2013, pp. 899–905, IEEE (2013)

  25. Araujo, M., Papadimitriou, S., Günnemann, S., Faloutsos, C., Basu, P., Swami, A., Papalexakis, E.E., Koutra, D.: Com2: fast automatic discovery of temporal (comet) communities. In: Advances in Knowledge Discovery and Data Mining, pp. 271–283. Springer, Berlin (2014)

  26. Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM 2008: Proceedings of the 8th IEEE International Conference on Data Mining, pp. 363–372 (2008)

  27. Chang, K.W., Yih, W.T., Meek, C.: Multi-relational latent semantic analysis. In: EMNLP, pp. 1602–1612 (2013)

  28. De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  29. Sun, J., Zeng, H., Liu, H., Lu, Y., Chen, Z.: Cubesvd: a novel approach to personalized web search. In: WWW (2005)

  30. Vasilescu, M., Terzopoulos, D.: Multilinear analysis of image ensembles: tensorfaces. Comput. Vis. ECCV 2002, 447–460 (2002)

    MATH  Google Scholar 

  31. Luo, D., Huang, H., Ding, C.: Discriminative high order SVD: adaptive tensor subspace selection for image classification, clustering, and retrieval. In: ICCV (2011)

  32. Bader, B.W., Kolda, T.G.: Efficient MATLAB computations with sparse and factored tensors. SIAM J. Sci. Comput. 30, 205–231 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  33. Beutel, A., Talukdar, P.P., Kumar, A.,Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: SDM (2014)

  34. Bro, R., Sidiropoulos, N., Giannakis, G.: A fast least squares algorithm for separating trilinear mixtures. In: International Workshop Independent Component and Blind Signal Separation Analytical, pp. 11–15 (1999)

  35. Kim, M., Candan, K.S.: Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 355–364. ACM, New York (2012)

  36. Erdös, D., Miettinen, P.: Scalable boolean tensor factorizations using random walks. In: CoRR, vol. abs/1310.4843 (2013)

Download references

Acknowledgments

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (grants No. 2013R1A1A3005259 and No. 2013R1A1A1064409). The ICT at Seoul National University provides research facilities for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to U. Kang.

Ethics declarations

Conflict of interest

There are no potential conflicts of interests.

Human and animals rights statement

The research does not involve human participants and/or animals.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jeon, I., Papalexakis, E.E., Faloutsos, C. et al. Mining billion-scale tensors: algorithms and discoveries. The VLDB Journal 25, 519–544 (2016). https://doi.org/10.1007/s00778-016-0427-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0427-4

Keywords

Navigation