Abstract
This study deals with missing link prediction, the problem of predicting the existence of missing connections between entities of interest. We approach the problem as filling in missing entries in a relational dataset represented by several matrices and multiway arrays, that will be simply called tensors. Consequently, we address the link prediction problem by data fusion formulated as simultaneous factorization of several observation tensors where latent factors are shared among each observation. Previous studies on joint factorization of such heterogeneous datasets have focused on a single loss function (mainly squared Euclidean distance or Kullback–Leibler-divergence) and specific tensor factorization models (CANDECOMP/PARAFAC and/or Tucker). However, in this paper, we study various alternative tensor models as well as loss functions including the ones already studied in the literature using the generalized coupled tensor factorization framework. Through extensive experiments on two real-world datasets, we demonstrate that (i) joint analysis of data from multiple sources via coupled factorization significantly improves the link prediction performance, (ii) selection of a suitable loss function and a tensor factorization model is crucial for accurate missing link prediction and loss functions that have not been studied for link prediction before may outperform the commonly-used loss functions, (iii) joint factorization of datasets can handle difficult cases, such as the cold start problem that arises when a new entity enters the dataset, and (iv) our approach is scalable to large-scale data.
Similar content being viewed by others
Notes
Some of the listed studies do not impose nonnegativity constraints on the factor matrices while GCTF assumes that all factor matrices are nonnegative.
References
Acar E, Kolda TG, Dunlavy DM (2011a) All-at-once optimization for coupled matrix and tensor factorizations. In: KDD’11 workshop proceedings
Acar E, Dunlavy D, Kolda TG, Morten M (2011b) Scalable tensor factorizations for incomplete data. Chemometr Intell Lab 106:41–56
Al Hasan M, Zaki MJ (2011) A survey of link prediction in social networks. In: Aggarwal CC (ed) Social network data analytics. Springer, New York
Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356
Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: SDM’07, pp 145–156
Candès EJ, Plan Y (2010) Matrix completion with noise. Proc IEEE 98:925–936
Cao B, Liu NN, Yang Q (2010) Transfer learning for collective link prediction in multiple heterogenous domains. In: ICML’10, pp 159–166
Carroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 35:283–319
Choudhury MD, Sundaram H, John A, Seligmann DD (2009) Social synchrony: predicting mimicry of user actions in online social media. In: CSE, vol 4, pp 151–158
Cichocki A, Zdunek R, Phan AH, Amari S (2009) Nonnegative matrix and tensor factorization. Wiley, Chichester
Clauset A, Moore C, Newman M (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453:98–101
Davis DA, Lichtenwalter R, Chawla NV (2011) Multi-relational link prediction in heterogeneous information networks. In: ASONAM’11, pp 281–288
Dunlavy DM, Kolda TG, Acar E (2011) Temporal link prediction using matrix and tensor factorizations. In: ACM TKDD’11, vol 5, Issue 2, Article 10
Ermis B, Cemgil AT (2013) A Bayesian tensor factorization model via variational inference for link prediction. In: NIPS 2013 workshop on probabilistic models for big data (PMBD)
Ermis B, Acar E, Cemgil TA (2012) Link prediction via generalized coupled tensor factorisation. In: ECML/PKDD workshop on collective learning and inference on structured data
Gandy S, Recht B, Yamada I (2011) Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Probl 27:025010
Getoor L, Diehl CP (2005) Link mining: a survey. ACM SIGKDD Explor Newsl 7(2):3–12
Harshman RA (1970) Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA Work Pap Phonetics 16:1–84
Harshman RA, Lundy ME (1996) Uniqueness proof for a family of models sharing features of Tucker’s three-mode factor analysis and PARAFAC/candecomp. Psychometrika 61(1):133–154
Hitchcock FL (1927) Multiple invariants and generalized rank of a p-way matrix or tensor. J Math Phys 7:39–79
Jamali M, Lakshmanan L (2013) HeteroMF: recommendation in heterogeneous information networks using context dependent factor models. In: Proceedings of the 22nd international conference on World Wide Web, WWW ’13, pp 643–654
Jiang M, Cui P, Liu R, Yang Q, Wang F, Zhu W, Yang S (2012) Social contextual recommendation. In: CIKM’12, pp 45–54
Kaas R (2005) Compound Poisson distributions and GLM’s, Tweedie’s distribution. Technical report. Royal Flemish Academy of Belgium for Science and the Arts, Brussels
Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37
Lin Y-R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A (2009) MetaFac: community discovery via relational hypergraph factorization. In: KDD’09, pp 527–536
Long B, Zhang (Mark) Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML’06, pp 585–592
Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: CIKM’08
Menon AK, Elkan C (2011) Link prediction via matrix factorization. In: ECML/PKDD’11, pp 437–452
Menon AK, Chitrapura KP, Garg S, Agarwal D, Kota N (2011) Response prediction using collaborative filtering with hierarchies and side-information. In: KDD’11, pp 141–149
Narita A, Hayashi K, Tomioka R, Kashima H (2011) Tensor factorization using auxiliary information. In: ECML PKDD’11, pp 501–516
Popescul A, Ungar LH (2003) Statistical relational learning for link prediction. In: IJCAI’03
Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375
Shi C, Kong X, Yu PS, Xie S, Wu B (2012) Relevance search in heterogeneous networks. In: EDBT. ACM, New York, NY, pp 180–191
Simsekli U, Cemgil AT (2012) Markov chain Monte Carlo inference for probabilistic latent tensor factorization. In: IEEE international workshop on machine learning for signal processing (MLSP)
Simsekli U, Cemgil AT, Yilmaz YK (2013a) Learning the beta-divergence in Tweedie compound Poisson matrix factorization models. In: Proceedings of the 30th international conference on machine learning (ICML-13), JMLR workshop and conference proceedings, May 2013, vol 28, pp 1409–1417
Şimşekli U, Ermiş B, Cemgil AT, Acar E (2013) Optimal weight learning for coupled tensor factorization with mixed divergences. In: EUSIPCO
Singh AP, Gordon GJ (2008) Relational learning via collective matrix factorization. In: KDD’08
Smilde AK, Westerhuis JA, Boque R (2000) Multiway multiblock component and covariates regression models. J Chemom 14:301–331
Spiegel S, Clausen JH, Albayrak S, Kunegis J (2011) Link prediction on evolving data using tensor factorization. In: PAKDD workshops, pp 100–110
Stäger M, Lukowicz P, Tröster G (2006) Dealing with class skew in context recognition. In: ICDCS workshops, p 58
Sun Y, Barber R, Gupta M, Aggarwal CC, Han J (2011) Co-author relationship prediction in heterogeneous bibliographic networks. In: ASONAM, pp 121–128
Tan VYF, Fevotte C (2013) Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Trans Pattern Anal Mach Intell 35(7):1592–1605
Taskar B, Wong M-F, Abbeel P, Koller D (2003) Link prediction in relational data. In: NIPS’03
Tucker LR (1963) Implications of factor analysis of three-way matrices for measurement of change. In: Harris CW (ed) Problems in measuring change. University of Wisconsin Press, Madison, pp 122– 137
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279– 311
Wang C, Raina R, Fong D, Zhou D, Han J, Badros GJ (2011) Learning relevance from heterogeneous social network and its application in online targeting. In: SIGIR. ACM, New York, NY, pp 655–664
Yang S-H, Long B, Smola AJ, Sadagopan N, Zheng Z, Zha H (2011) Like like alike: joint friendship and interest propagation in social networks. In: WWW’11, pp 537–546
Yang Y, Chawla NV, Sun Y, Han J (2012) Predicting links in multi-relational and heterogeneous networks. In: ICDM’12, pp 755–764
Yilmaz YK (2012) Generalized tensor factorization. PhD Thesis, Bogazici University
Yilmaz YK, Cemgil AT (2010) Probabilistic latent tensor factorization. In: LVA/ICA, pp 346–353
Yılmaz YK, Cemgil AT (2012) Alpha/beta divergences and Tweedie models. arXiv: 1209.4280 v1
Yilmaz YK, Cemgil AT, Simsekli U (2011) Generalised coupled tensor factorisation. In: NIPS’11
Yoo J, Choi S (2012) Hierarchical variational Bayesian matrix co-factorization. In: ICASSP’12, pp 1901–1904
Yoo J, Kim M, Kang K, Choi S (2010) Nonnegative matrix partial co-factorization for drum source separation. In: ICASSP’10, pp 1942–1945
Yu X, Gu Q, Zhou M, Han J (2012) Citation prediction in heterogeneous bibliographic networks. In: SDM. SIAM/Omnipress, Anaheim, CA, pp 1119–1130
Zheng VW, Cao B, Zheng Y, Xie X, Yang Q (2010) Collaborative filtering meets mobile recommendation: a user-centered approach. In: AAAI’10
Zheng VW, Zheng Y, Xie X, Yang Q (2012) Towards mobile intelligence: learning from GPS history data for collaborative recommendation. Artif Intell 184–185:17–37
Acknowledgments
This work is funded by the TUBITAK Grant Number 110E292, Bayesian matrix and tensor factorisations (BAYTEN) and Boğaziçi University Research Fund BAP5723. It is also funded in part by the Danish Council for Independent Research—Technology and Production Sciences and Sapere Aude Program under the Projects 11-116328 and 11-120947.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jian Pei.
Appendix
Appendix
1.1 Computation for common factors
Here, we show the computation for A:
and \(B\):
given in Model 1, Sect. 4.1.
1.2 Computational complexity
We have conducted experiments on tensor completion problem to demonstrate that time complexity of the modeling framework is \(O(N)\) for sparse datasets, where N is the number of known entries. We consider two situations in these experiments: (i) \(500 \times 500 \times 500\) three-way array with 99 % missing data (1.25 million known values), and (ii) \(1,000 \times 1,000 \times 1,000\) three-way array with 98 % missing data (20 million known values). We have used CP tensor factorization model with R = 3 components to generate data, then added 20 % random Gaussian noise. We have then fitted a CP model using EUC distance-based loss function and used the extracted CP factors to reconstruct the data. Figure 17 shows the average tensor completion performance of 10 independent runs in terms of RMSE score. In the \(500 \times 500 \times 500\) case, all ten problems have been solved with an RMSE score around 0.20, with computation times ranging between 400 and 500 s and in the \(1,000 \times 1,000 \times 1,000\) case, all ten problems are also solved with an RMSE score around 0.20. The computation times have ranged from 8,000 to 12,000 s, approximately 20 times slower than the \(500 \times 500 \times 500\) case, which has 16 times more non-missing entries.
Rights and permissions
About this article
Cite this article
Ermiş, B., Acar, E. & Cemgil, A.T. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min Knowl Disc 29, 203–236 (2015). https://doi.org/10.1007/s10618-013-0341-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0341-y