Skip to main content
Log in

FacetCube: a general framework for non-negative tensor factorization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Non-negative tensor factorization (NTF) has been successfully used to extract significant characteristics from polyadic data, such as data in social networks. Because these polyadic data have multiple dimensions (e.g., the author, content, and timestamp of a blog post), NTF fits in naturally and extracts data characteristics jointly from different data dimensions. In the traditional NTF, all information comes from the observed data, and therefore, the end users have no control over the outcomes. However, in many applications very often, the end users have certain prior knowledge, such as the demographic information about individuals in a social network or a pre-constructed ontology on the contents and therefore prefer the data characteristics extracting by NTF being consistent with such prior knowledge. To allow users’ prior knowledge to be naturally incorporated into NTF, in this paper, we present a general framework—FacetCube—that extends the standard NTF. The new framework allows the end users to control the factorization outputs at three different levels for each of the data dimensions. The proposed framework is intuitively appealing in that it has a close connection to the probabilistic generative models. In addition to introducing the framework, we provide an iterative algorithm for computing the optimal solution to the framework. We also develop an efficient implementation of the algorithm that consists of several techniques to make our framework scalable to large data sets. Extensive experimental studies on a paper citation data set and a blog data set demonstrate that our new framework is able to effectively incorporate users’ prior knowledge, improves performance over the traditional NTF on the task of personalized recommendation, and is scalable to large data sets from real-life applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. FacetCube stands for “factorize data using NTF with co-ordinates being unconstrained, basis-constrained, or fixed”.

  2. For the discussion in this part, we restrict attention to the case of unconstrained dimensions, because the computations for \(X_{B}X, Y_{B}Y\), and \(Z_{B}Z\) do not dominate the time complexity.

  3. http://citeseer.ist.psu.edu/.

  4. http://opennlp.sourceforge.net/.

References

  1. Aussenac-Gilles N, Mothe J (2004) Ontologies as background knowledge to explore document collections. In: RIAO, pp 129–142

  2. Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: SIAM international conference on data mining

  3. Carroll JD, Pruzansky S, Kruskal JB (1980) CANDELINC: a general approach to multi- dimensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45

  4. Chew PA, Bader BW, Kolda TG, Abdelali A (2007) Cross-language information retrieval using PARAFAC2. In: Proceedings of the 13th SIGKDD conference

  5. Chi Y, Zhu S (2010) FacetCube: a framework of incorporating prior knowledge into non-negative tensor factorization. In: Proceedings of the 19th CIKM conference

  6. Chi Y, Zhu S, Song X, Tatemura J, Tseng BL (2007) Structural and temporal analysis of the blogosphere through community factorization. In: Proceedings of the 13th SIGKDD conference

  7. Chi Y, Zhu S, Gong Y, Zhang Y (2008) Probabilistic polyadic factorization and its application to personalized recommendation. In: Proceedings of the 17th CIKM conference

  8. Chi Y, Zhu S, Hino K, Gong Y, Zhang Y (2009) iOLAP: a framework for analyzing the internet, social networks, and other networked data. IEEE Trans Multimedia 11(3):372–382

    Article  Google Scholar 

  9. De Lathauwer L, De Moor B, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Anal Appl 21(4):1253–1278. doi:10.1137/S0895479896305696

    Google Scholar 

  10. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD conference

  11. Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM SDM

  12. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on machine learning

  13. Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems

  14. FitzGerald D, Cranitch M, Coyle E (2005) Non-negative tensor factorisation for sound source separation. In: Proceedings of the irish signals and systems conference

  15. Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: Proceedings of the 28th SIGIR conference

  16. Harshman RA (1970) Foundations of the parafac procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16

  17. Hazan T, Polak S, Shashua A (2005) Sparse image coding using a 3d non-negative tensor factorization. In: Proceedings of the 10th ICCV conference

  18. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196. doi:10.1023/A:1007617005950

    Google Scholar 

  19. Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd SIGIR conference

  20. Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500

    Article  MathSciNet  MATH  Google Scholar 

  21. Lafferty JD, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp 111–119

  22. Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. In: NIPS

  23. Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) FacetNet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th WWW conference

  24. Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2009a) Analyzing communities and their evolutions in dynamic social networks. ACM Trans Knowl Discov Data 3(2):8:1–8:31. doi:10.1145/1514888.1514891

  25. Lin Y-R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A (2009b) MetaFac: community discovery via relational hypergraph factorization. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining

  26. Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of the 13th SIGKDD conference

  27. Mørup M, Hansen LK, Arnfred SM (2008) Algorithms for sparse nonnegative Tucker decompositions. Neural Comput 20(8):2112–2131

    Article  Google Scholar 

  28. Peng W (2009) Equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval

  29. Porteous I, Bart E, Welling M (2008) Multi-hdp: a non parametric bayesian model for tensor factorization. In: Proceedings of the 23rd national conference on artificial intelligence

  30. Shashua A, Hazan T (2005) Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd ICML conference

  31. Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) GraphScope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th SIGKDD conference

  32. Sun J-T, Zeng H-J, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the 14th WWW conference

  33. Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika, 31

  34. Wang F, Li P, Knig A, Wan M (2011) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst, pp 1–32

  35. Xiong L, Chen X, Huang T-K, Schneider J, Carbonell JG (2010) Temporal collaborative filtering with bayesian probabilistic tensor factorization. In: SDM

  36. Zaragoza H, Hiemstra D, Tipping ME (2003) Bayesian extension to the language model for ad hoc information retrieval. In: SIGIR, pp 4–9

  37. Zhang Z-Y, Li T, Ding C (2011) Non-negative tri-factor tensor decomposition with applications. Knowl Inf Syst, pp 1–23

  38. Zhou D, Zhu S, Yu K, Song X, Tseng BL, Zha H, Giles CL (2008) Learning multiple graphs for document recommendations. In: WWW, pp 141–150

  39. Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: SIGIR

Download references

Acknowledgments

The authors would like to thank Professor C. Lee Giles for providing the CiteSeer data set and thank Koji Hino and Junichi Tatemura for helping prepare the blog data set.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Chi.

Appendix

Appendix

1.1 Proof for Theorem 1

Proof

Assume that the values obtained from the previous iteration are \(\tilde{X}, \tilde{Y}, \tilde{Z}\), and \(\tilde{\mathcal{C }}\), respectively. We prove the update rule for \(X\). The rules for \(Y, Z\), and \(\mathcal C \) can be proved similarly. For the update rule of \(X\), we can consider \(Y, Z\), and \(\mathcal C \) as fixed (i.e., fixed as their values \(\tilde{Y}, \tilde{Z}\), and \(\tilde{\mathcal{C }}\) in the previous iteration). To avoid notation clutters, we define \(\tilde{\tilde{X}}\doteq X_{B}\tilde{X}, \tilde{\tilde{Y}}\doteq Y_{B}\tilde{Y}, \tilde{\tilde{Z}}=Z_{B}\tilde{Z}\), and we rewrite the objective function as

$$\begin{aligned} \min D_{X}(X) = \min KL(\mathcal A || [\tilde{\mathcal{C }}, X_{B}X, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]) \end{aligned}$$

First define

$$\begin{aligned} \gamma _{ijklmnl^{\prime }} = \tilde{\mathcal{C }}_{lmn} (X_B)_{il^{\prime }}\tilde{X}_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \end{aligned}$$

and

$$\begin{aligned} \theta _{ijklmnl^{\prime }} = \frac{\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}\tilde{X}_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}}{[\tilde{\mathcal{C }}, \tilde{\tilde{X}}, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]_{ijk}} = \frac{\gamma _{ijklmnl^{\prime }}}{[\tilde{\mathcal{C }}, \tilde{\tilde{X}}, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]_{ijk}} \end{aligned}$$

where obviously we have \(\sum _{ijklmnl^{\prime }} \theta _{ijklmnl^{\prime }} = 1\).

Then we have

$$\begin{aligned} \begin{aligned}&D_{X}(X) \\ =&\sum _{ijk}\left[\sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right.\\&\left. - \mathcal A _{ijk}\ln \left( \sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}\right) \right] + c_{1}\\ \le&\sum _{ijklmnl^{\prime }} \left[ \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right.\\&\left. - \mathcal A _{ijk} \theta _{ijklmnl^{\prime }} \ln \frac{\sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}}{\theta _{ijklmnl^{\prime }}} \right] + c_{1}\\ =&\sum _{ijklmnl^{\prime }} \left[ \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right. \\&\left. - \gamma _{ijklmnl^{\prime }} \tilde{\mathcal{B }}_{ijk} \ln \left(\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right) \right] + c_{2}\\ =&- \sum _{ijklmnl^{\prime }} \left[ \gamma _{ijklmnl^{\prime }} \tilde{\mathcal{B }}_{ijk} \ln \left(\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right) \right] + c_{3}\\ \doteq&Q(X;\tilde{X}), \end{aligned} \end{aligned}$$

where \(c_{1}, c_{2}\), and \(c_{3}\) are constants irrelevant to \(X\). Note that in the last step of the above derivation, we used the fact that \(\sum _{ijklmnl^{\prime }}\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} = \sum _{ijk}\tilde{\mathcal{C }}_{lmn}\) because the columns of \(X\) all sum to 1.

It can be easily shown that \(Q_{X}(X;\tilde{X})\) is an auxiliary function of \(D_{X}(X)\) in the sense that

$$\begin{aligned} D_{X}(X)&\le Q_{X}(X;\tilde{X})\text{,} \text{ and}\end{aligned}$$
(16)
$$\begin{aligned} D_{X}(X)&= Q_{X}(X;X). \end{aligned}$$
(17)

With such an auxiliary function, we can use the following EM-style argument to show that \({X}^{*}=\mathop {\arg \,\min }\limits _{X} Q(X;\tilde{X})\) actually reduces \(D_{X}(X)\), namely \(D_{X}({X}^{*}) \le D_{X}(\tilde{X})\) in guaranteed:

$$\begin{aligned} D_{X}(\tilde{X})&= Q_{X}(\tilde{X};\tilde{X})\,\, (\text{ by} \text{ using} \text{ Equation}\quad (17))\\&\ge Q_{X}({X}^{*};\tilde{X})\\&\ge D_{X}({X}^{*})\,\,(\text{ by} \text{ using} \text{ Equation}\quad (16)) \end{aligned}$$

So the problem is reduced to minimizing \(Q_{X}(X;\tilde{X})\) with respect to \(X\), under the constraint that all the columns of \(X\) sum to ones. We define the Lagrangian

$$\begin{aligned} L(X,\vec {\lambda }) = Q(X;\tilde{X}) + \vec {\lambda }^{T}({X}^{T}\vec {1}_{L}-\vec {1}_{L^{\prime }}), \end{aligned}$$

and by taking its derivative and setting the result to zero, we have

$$\begin{aligned} \frac{\partial L}{\partial X_{l^{\prime }l}}&= \frac{\tilde{X}_{l^{\prime }l}}{X_{l^{\prime }l}}\sum _{ijkmn}\tilde{\mathcal{B }}_{ijk}\tilde{\mathcal{C }}_{lmn}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}+\lambda _{l} = 0\\ \frac{\partial L}{\partial \lambda _l}&= \sum _{l^{\prime }} X_{l^{\prime }l}-1=0 \end{aligned}$$

which gives the update rule for \(X\) in Theorem 1. \(\square \)

1.2 Proof for Corollary 1

A simplex space of dimension \(p-1\) shrunk by \(\epsilon \), where \(p \epsilon <1\), is defined as

$$\begin{aligned} \mathbb S ^{p-1}_{\epsilon } = \left\{ x \in \mathbb R ^{p}_{+}: {x}_{k}\ge \epsilon \mathrm ~and~ \sum _{k=1}^p x_{k} =1\right\} . \end{aligned}$$

For a given \(\zeta \in \mathbb R ^p\), and \(\max \{\zeta _k\}>0\), we define a projection

$$\begin{aligned} \mathcal{P}^{p}_{\epsilon } \zeta = \mathop {\arg \,\min }\limits _{\xi \in \mathbb S ^{p-1}_{\epsilon }} - \sum _{k} \zeta _{k} \ln \xi _{k}. \end{aligned}$$
(18)

Lemma 2

The loss function is defined as

$$\begin{aligned} f(x) = -\sum _{i} a_{i} \ln \left(\sum _{k} w_{ik} x_{k}\right) - (\alpha -1) \sum _{k} \ln (x_{k}), \end{aligned}$$

where \(a_{i}\ge 0, \sum _{i} a_{i} > p(1-\alpha )\), and \(w_{ik}>0\). For \(x \in \mathbb S ^{p-1}_{\epsilon }\), where \(\epsilon < 1/p\), if

$$\begin{aligned} \begin{aligned} b_{i}&= \frac{a_{i} }{\sum _{k} w_{ik} x_{k}}, \\ \zeta _{k}&= x_{k} \sum _{i} {b}_{i} {w}_{ik} + \alpha -1, \\ \xi&= \mathcal{P}^{p}_{\epsilon } \zeta . \end{aligned} \end{aligned}$$

then \(\xi \in \mathbb S ^{p-1}_{\epsilon }\) and \(f(\xi ) \le f(x)\) .

 

Proof

We introduce an auxiliary function,

$$\begin{aligned} \begin{aligned} g(z; x)&= -\sum _{ik}{b_{i}w_{ik}x_{k}} \ln (z_{k}) +\sum _{ik}{b_{i}w_{ik}x_{k}} \ln \left(\frac{x_{k}}{\sum _{k} w_{ik} x_{k}}\right) - (\alpha -1) \sum _{k} \ln (z_{k}) \\&= -\sum _{k} \zeta _{k} \ln (z_{k}) +\sum _{ik}{b_{i}w_{ik}x_{k}} \ln \left(\frac{x_{k}}{\sum _{k} w_{ik} x_{k}}\right) \end{aligned} \end{aligned}$$

We have \(g(x;x)=f(x)\) and \(g(z;x)\ge f(z)\) for any \(z\), because of convexity of \(-\ln (x)\) in the first term.

We have \(\max \{\zeta _{k}\} > 0\), because \(\sum _{k} \zeta _{k} = \sum _{i} a_{i} + p\alpha -p >0\). We have \(\lambda > 0\), because \(\lambda = \zeta _{k} /\xi _{k} +\gamma _{k} \ge \zeta _{k} /\xi _k \ge \max \{\zeta _{k}\} /\epsilon > 0\). Thus, \(\xi = \mathcal{P}^{p}_{\epsilon } \zeta \) minimizes \(g(z;x)\). Therefore, \(f(\xi ) \le g(\xi ;x) \le g(x;x)=f(x)\).

Although we do not have explicit equation for \(\mathcal{P}^{p}_{\epsilon }\), inspired by [12], we can solve \(\mathcal{P}^{p}_{\epsilon } \zeta \) efficiently.

The Lagrangian for Eq. (18) is

$$\begin{aligned} \mathcal{L}= - \sum _{k} \zeta _{k} \ln \xi _{k} + \lambda \left(\sum _{k} \xi _{k} -1\right) + \sum _{k} \gamma _{k} (\epsilon - \xi _{k}), \end{aligned}$$

where \(\gamma _{k} \ge 0\). With KKT condition, we have \( -\frac{1}{\xi _{k}} \zeta _{k} +\lambda - \gamma _{k} =0 , \sum _{k} \xi _{k} =1\), and \(\gamma _{k} =0\) if \(\xi _{k} >\epsilon \).

We prove that \(\xi _{k} \ge \xi _{l}\) if \(\zeta _{k} \ge \zeta _{l}\). Suppose that \(\xi _{l} > \xi _{k}\). If \(\xi _{l} > \xi _{k} > \epsilon \), it is contradicted by \(\zeta _{l} = \lambda \xi _{l} > \lambda \xi _{k} = \zeta _{k}\). If \(\xi _{l} > \xi _{k} = \epsilon \), it is contradicted by \(\zeta _{l} = \lambda \xi _{l} > \lambda \epsilon \ge (\lambda -\gamma _{k}) \xi _{k} = \zeta _{k}\).

We look for \(\omega \), such that \(\xi _{k} = \epsilon \) iff \(\zeta _{k} \le \omega \). Thus \(\xi _{k} = \zeta _{k} /\lambda > \epsilon \) if \(\zeta _{k} > \omega \). Let \(\mathbb A _{\omega } =\{k: \zeta _{k} \le \omega \}\), then

$$\begin{aligned} \frac{1}{\lambda }\sum _{k \not \in \mathbb A _{\omega }} \zeta _{k} + |\mathbb A _{\omega }| \epsilon = 1. \end{aligned}$$

So \(\lambda = \frac{\sum _{k \not \in \mathbb A _{\omega }} \zeta _{k}}{1-|\mathbb A _{\omega }|\epsilon }\). Because \(\xi _{\omega } = \epsilon , \lambda \ge \omega / \xi _{\omega }\), thus

$$\begin{aligned} \epsilon \sum _{\zeta _{k} > \omega } \zeta _{k} \ge \omega (1-|\{\zeta _{k} \le \omega \}|\epsilon ) . \end{aligned}$$
(19)

Since \(\max \{\zeta _{k}\} > 0\) and \(\epsilon < 1/p\), we can find a valid \(\omega \) such that \(\omega \ne \max \{\zeta _{k}\}\). We select the largest \(\omega \) that satisfies Eq. (19).

With the analysis above, we solve \(\mathcal{P}^{p}_{\epsilon }\) by Algorithm 7.1.

figure a1

Proof of Corollary 1

Each of \(\mathcal C, X, Y\), and \(Z\) can be updated using Lemma 2 after certain reshape. Thus we can sequentially minimize the loss function of Eq. (5).\(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, Y., Zhu, S. FacetCube: a general framework for non-negative tensor factorization. Knowl Inf Syst 37, 155–179 (2013). https://doi.org/10.1007/s10115-012-0566-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0566-x

Keywords

Navigation