Skip to main content
Log in

Link prediction in heterogeneous data via generalized coupled tensor factorization

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This study deals with missing link prediction, the problem of predicting the existence of missing connections between entities of interest. We approach the problem as filling in missing entries in a relational dataset represented by several matrices and multiway arrays, that will be simply called tensors. Consequently, we address the link prediction problem by data fusion formulated as simultaneous factorization of several observation tensors where latent factors are shared among each observation. Previous studies on joint factorization of such heterogeneous datasets have focused on a single loss function (mainly squared Euclidean distance or Kullback–Leibler-divergence) and specific tensor factorization models (CANDECOMP/PARAFAC and/or Tucker). However, in this paper, we study various alternative tensor models as well as loss functions including the ones already studied in the literature using the generalized coupled tensor factorization framework. Through extensive experiments on two real-world datasets, we demonstrate that (i) joint analysis of data from multiple sources via coupled factorization significantly improves the link prediction performance, (ii) selection of a suitable loss function and a tensor factorization model is crucial for accurate missing link prediction and loss functions that have not been studied for link prediction before may outperform the commonly-used loss functions, (iii) joint factorization of datasets can handle difficult cases, such as the cold start problem that arises when a new entity enters the dataset, and (iv) our approach is scalable to large-scale data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. Some of the listed studies do not impose nonnegativity constraints on the factor matrices while GCTF assumes that all factor matrices are nonnegative.

    Table 1 Related studies on coupled factorization of heterogenous data
  2. http://www.cse.ust.hk/~vincentz/aaai10.uclaf.data.mat.

  3. http://www.public.esu.edu/~ylin56/kdd09sup.html.

References

  • Acar E, Kolda TG, Dunlavy DM (2011a) All-at-once optimization for coupled matrix and tensor factorizations. In: KDD’11 workshop proceedings

  • Acar E, Dunlavy D, Kolda TG, Morten M (2011b) Scalable tensor factorizations for incomplete data. Chemometr Intell Lab 106:41–56

    Article  Google Scholar 

  • Al Hasan M, Zaki MJ (2011) A survey of link prediction in social networks. In: Aggarwal CC (ed) Social network data analytics. Springer, New York

    Google Scholar 

  • Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci USA 100:3351–3356

    Article  Google Scholar 

  • Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: SDM’07, pp 145–156

  • Candès EJ, Plan Y (2010) Matrix completion with noise. Proc IEEE 98:925–936

    Article  Google Scholar 

  • Cao B, Liu NN, Yang Q (2010) Transfer learning for collective link prediction in multiple heterogenous domains. In: ICML’10, pp 159–166

  • Carroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 35:283–319

    Article  MATH  Google Scholar 

  • Choudhury MD, Sundaram H, John A, Seligmann DD (2009) Social synchrony: predicting mimicry of user actions in online social media. In: CSE, vol 4, pp 151–158

  • Cichocki A, Zdunek R, Phan AH, Amari S (2009) Nonnegative matrix and tensor factorization. Wiley, Chichester

    Book  Google Scholar 

  • Clauset A, Moore C, Newman M (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453:98–101

    Article  Google Scholar 

  • Davis DA, Lichtenwalter R, Chawla NV (2011) Multi-relational link prediction in heterogeneous information networks. In: ASONAM’11, pp 281–288

  • Dunlavy DM, Kolda TG, Acar E (2011) Temporal link prediction using matrix and tensor factorizations. In: ACM TKDD’11, vol 5, Issue 2, Article 10

  • Ermis B, Cemgil AT (2013) A Bayesian tensor factorization model via variational inference for link prediction. In: NIPS 2013 workshop on probabilistic models for big data (PMBD)

  • Ermis B, Acar E, Cemgil TA (2012) Link prediction via generalized coupled tensor factorisation. In: ECML/PKDD workshop on collective learning and inference on structured data

  • Gandy S, Recht B, Yamada I (2011) Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Probl 27:025010

    Article  MathSciNet  Google Scholar 

  • Getoor L, Diehl CP (2005) Link mining: a survey. ACM SIGKDD Explor Newsl 7(2):3–12

    Article  Google Scholar 

  • Harshman RA (1970) Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA Work Pap Phonetics 16:1–84

    Google Scholar 

  • Harshman RA, Lundy ME (1996) Uniqueness proof for a family of models sharing features of Tucker’s three-mode factor analysis and PARAFAC/candecomp. Psychometrika 61(1):133–154

    Article  MATH  MathSciNet  Google Scholar 

  • Hitchcock FL (1927) Multiple invariants and generalized rank of a p-way matrix or tensor. J Math Phys 7:39–79

    MATH  Google Scholar 

  • Jamali M, Lakshmanan L (2013) HeteroMF: recommendation in heterogeneous information networks using context dependent factor models. In: Proceedings of the 22nd international conference on World Wide Web, WWW ’13, pp 643–654

  • Jiang M, Cui P, Liu R, Yang Q, Wang F, Zhu W, Yang S (2012) Social contextual recommendation. In: CIKM’12, pp 45–54

  • Kaas R (2005) Compound Poisson distributions and GLM’s, Tweedie’s distribution. Technical report. Royal Flemish Academy of Belgium for Science and the Arts, Brussels

  • Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37

    Article  Google Scholar 

  • Lin Y-R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A (2009) MetaFac: community discovery via relational hypergraph factorization. In: KDD’09, pp 527–536

  • Long B, Zhang (Mark) Z, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: ICML’06, pp 585–592

  • Ma H, Yang H, Lyu MR, King I (2008) Sorec: social recommendation using probabilistic matrix factorization. In: CIKM’08

  • Menon AK, Elkan C (2011) Link prediction via matrix factorization. In: ECML/PKDD’11, pp 437–452

  • Menon AK, Chitrapura KP, Garg S, Agarwal D, Kota N (2011) Response prediction using collaborative filtering with hierarchies and side-information. In: KDD’11, pp 141–149

  • Narita A, Hayashi K, Tomioka R, Kashima H (2011) Tensor factorization using auxiliary information. In: ECML PKDD’11, pp 501–516

  • Popescul A, Ungar LH (2003) Statistical relational learning for link prediction. In: IJCAI’03

  • Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375

    Article  MATH  Google Scholar 

  • Shi C, Kong X, Yu PS, Xie S, Wu B (2012) Relevance search in heterogeneous networks. In: EDBT. ACM, New York, NY, pp 180–191

  • Simsekli U, Cemgil AT (2012) Markov chain Monte Carlo inference for probabilistic latent tensor factorization. In: IEEE international workshop on machine learning for signal processing (MLSP)

  • Simsekli U, Cemgil AT, Yilmaz YK (2013a) Learning the beta-divergence in Tweedie compound Poisson matrix factorization models. In: Proceedings of the 30th international conference on machine learning (ICML-13), JMLR workshop and conference proceedings, May 2013, vol 28, pp 1409–1417

  • Şimşekli U, Ermiş B, Cemgil AT, Acar E (2013) Optimal weight learning for coupled tensor factorization with mixed divergences. In: EUSIPCO

  • Singh AP, Gordon GJ (2008) Relational learning via collective matrix factorization. In: KDD’08

  • Smilde AK, Westerhuis JA, Boque R (2000) Multiway multiblock component and covariates regression models. J Chemom 14:301–331

    Article  Google Scholar 

  • Spiegel S, Clausen JH, Albayrak S, Kunegis J (2011) Link prediction on evolving data using tensor factorization. In: PAKDD workshops, pp 100–110

  • Stäger M, Lukowicz P, Tröster G (2006) Dealing with class skew in context recognition. In: ICDCS workshops, p 58

  • Sun Y, Barber R, Gupta M, Aggarwal CC, Han J (2011) Co-author relationship prediction in heterogeneous bibliographic networks. In: ASONAM, pp 121–128

  • Tan VYF, Fevotte C (2013) Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Trans Pattern Anal Mach Intell 35(7):1592–1605

    Google Scholar 

  • Taskar B, Wong M-F, Abbeel P, Koller D (2003) Link prediction in relational data. In: NIPS’03

  • Tucker LR (1963) Implications of factor analysis of three-way matrices for measurement of change. In: Harris CW (ed) Problems in measuring change. University of Wisconsin Press, Madison, pp 122– 137

  • Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279– 311

    Google Scholar 

  • Wang C, Raina R, Fong D, Zhou D, Han J, Badros GJ (2011) Learning relevance from heterogeneous social network and its application in online targeting. In: SIGIR. ACM, New York, NY, pp 655–664

  • Yang S-H, Long B, Smola AJ, Sadagopan N, Zheng Z, Zha H (2011) Like like alike: joint friendship and interest propagation in social networks. In: WWW’11, pp 537–546

  • Yang Y, Chawla NV, Sun Y, Han J (2012) Predicting links in multi-relational and heterogeneous networks. In: ICDM’12, pp 755–764

  • Yilmaz YK (2012) Generalized tensor factorization. PhD Thesis, Bogazici University

  • Yilmaz YK, Cemgil AT (2010) Probabilistic latent tensor factorization. In: LVA/ICA, pp 346–353

  • Yılmaz YK, Cemgil AT (2012) Alpha/beta divergences and Tweedie models. arXiv: 1209.4280 v1

  • Yilmaz YK, Cemgil AT, Simsekli U (2011) Generalised coupled tensor factorisation. In: NIPS’11

  • Yoo J, Choi S (2012) Hierarchical variational Bayesian matrix co-factorization. In: ICASSP’12, pp 1901–1904

  • Yoo J, Kim M, Kang K, Choi S (2010) Nonnegative matrix partial co-factorization for drum source separation. In: ICASSP’10, pp 1942–1945

  • Yu X, Gu Q, Zhou M, Han J (2012) Citation prediction in heterogeneous bibliographic networks. In: SDM. SIAM/Omnipress, Anaheim, CA, pp 1119–1130

  • Zheng VW, Cao B, Zheng Y, Xie X, Yang Q (2010) Collaborative filtering meets mobile recommendation: a user-centered approach. In: AAAI’10

  • Zheng VW, Zheng Y, Xie X, Yang Q (2012) Towards mobile intelligence: learning from GPS history data for collaborative recommendation. Artif Intell 184–185:17–37

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work is funded by the TUBITAK Grant Number 110E292, Bayesian matrix and tensor factorisations (BAYTEN) and Boğaziçi University Research Fund BAP5723. It is also funded in part by the Danish Council for Independent Research—Technology and Production Sciences and Sapere Aude Program under the Projects 11-116328 and 11-120947.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Beyza Ermiş.

Additional information

Responsible editor: Jian Pei.

Appendix

Appendix

1.1 Computation for common factors

Here, we show the computation for A:

$$\begin{aligned} \varDelta _{A,1}(Q)&= \left[ \sum \limits _{j,k} Q^{i,j,k} \left( B^{j,r} C^{k,r}\right) \right] = Q_1(BC), \\ \varDelta _{A,2}(Q)&= \left[ \sum \limits _{m} Q^{i,m} \left( D^{m,r}\right) \right] = Q_2 D, \\&A \leftarrow A \circ \frac{Q_1(BC) + Q_2 D}{\hat{X}_1^{-p}(BC) + \hat{X}_2^{-p} D}, \end{aligned}$$

and \(B\):

$$\begin{aligned} \varDelta _{B,1}(Q)&= \left[ \sum \limits _{i,k} Q^{i,j,k} \left( A^{i,r} C^{k,r}\right) \right] = Q_1(AC), \\ \varDelta _{B,2}(Q)&= \left[ \sum \limits _{n} Q^{j,n}\left( E^{n,r}\right) \right] = Q_2 E, \\&B \leftarrow B \circ \frac{Q_1(AC) + Q_3 E}{\hat{X}_1^{-p}(AC) + \hat{X}_3^{-p} E}, \end{aligned}$$

given in Model 1, Sect. 4.1.

1.2 Computational complexity

We have conducted experiments on tensor completion problem to demonstrate that time complexity of the modeling framework is \(O(N)\) for sparse datasets, where N is the number of known entries. We consider two situations in these experiments: (i) \(500 \times 500 \times 500\) three-way array with 99 % missing data (1.25 million known values), and (ii) \(1,000 \times 1,000 \times 1,000\) three-way array with 98 % missing data (20 million known values). We have used CP tensor factorization model with R = 3 components to generate data, then added 20 % random Gaussian noise. We have then fitted a CP model using EUC distance-based loss function and used the extracted CP factors to reconstruct the data. Figure 17 shows the average tensor completion performance of 10 independent runs in terms of RMSE score. In the \(500 \times 500 \times 500\) case, all ten problems have been solved with an RMSE score around 0.20, with computation times ranging between 400 and 500 s and in the \(1,000 \times 1,000 \times 1,000\) case, all ten problems are also solved with an RMSE score around 0.20. The computation times have ranged from 8,000 to 12,000 s, approximately 20 times slower than the \(500 \times 500 \times 500\) case, which has 16 times more non-missing entries.

Fig. 17
figure 17

Results of our algorithm for large-scale problems. The means are shown as solid lines

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ermiş, B., Acar, E. & Cemgil, A.T. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min Knowl Disc 29, 203–236 (2015). https://doi.org/10.1007/s10618-013-0341-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0341-y

Keywords

Navigation