Abstract
Non-negative tensor factorization (NTF) has been successfully used to extract significant characteristics from polyadic data, such as data in social networks. Because these polyadic data have multiple dimensions (e.g., the author, content, and timestamp of a blog post), NTF fits in naturally and extracts data characteristics jointly from different data dimensions. In the traditional NTF, all information comes from the observed data, and therefore, the end users have no control over the outcomes. However, in many applications very often, the end users have certain prior knowledge, such as the demographic information about individuals in a social network or a pre-constructed ontology on the contents and therefore prefer the data characteristics extracting by NTF being consistent with such prior knowledge. To allow users’ prior knowledge to be naturally incorporated into NTF, in this paper, we present a general framework—FacetCube—that extends the standard NTF. The new framework allows the end users to control the factorization outputs at three different levels for each of the data dimensions. The proposed framework is intuitively appealing in that it has a close connection to the probabilistic generative models. In addition to introducing the framework, we provide an iterative algorithm for computing the optimal solution to the framework. We also develop an efficient implementation of the algorithm that consists of several techniques to make our framework scalable to large data sets. Extensive experimental studies on a paper citation data set and a blog data set demonstrate that our new framework is able to effectively incorporate users’ prior knowledge, improves performance over the traditional NTF on the task of personalized recommendation, and is scalable to large data sets from real-life applications.
Similar content being viewed by others
Notes
FacetCube stands for “factorize data using NTF with co-ordinates being unconstrained, basis-constrained, or fixed”.
For the discussion in this part, we restrict attention to the case of unconstrained dimensions, because the computations for \(X_{B}X, Y_{B}Y\), and \(Z_{B}Z\) do not dominate the time complexity.
References
Aussenac-Gilles N, Mothe J (2004) Ontologies as background knowledge to explore document collections. In: RIAO, pp 129–142
Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: SIAM international conference on data mining
Carroll JD, Pruzansky S, Kruskal JB (1980) CANDELINC: a general approach to multi- dimensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45
Chew PA, Bader BW, Kolda TG, Abdelali A (2007) Cross-language information retrieval using PARAFAC2. In: Proceedings of the 13th SIGKDD conference
Chi Y, Zhu S (2010) FacetCube: a framework of incorporating prior knowledge into non-negative tensor factorization. In: Proceedings of the 19th CIKM conference
Chi Y, Zhu S, Song X, Tatemura J, Tseng BL (2007) Structural and temporal analysis of the blogosphere through community factorization. In: Proceedings of the 13th SIGKDD conference
Chi Y, Zhu S, Gong Y, Zhang Y (2008) Probabilistic polyadic factorization and its application to personalized recommendation. In: Proceedings of the 17th CIKM conference
Chi Y, Zhu S, Hino K, Gong Y, Zhang Y (2009) iOLAP: a framework for analyzing the internet, social networks, and other networked data. IEEE Trans Multimedia 11(3):372–382
De Lathauwer L, De Moor B, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Anal Appl 21(4):1253–1278. doi:10.1137/S0895479896305696
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD conference
Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM SDM
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on machine learning
Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems
FitzGerald D, Cranitch M, Coyle E (2005) Non-negative tensor factorisation for sound source separation. In: Proceedings of the irish signals and systems conference
Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: Proceedings of the 28th SIGIR conference
Harshman RA (1970) Foundations of the parafac procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16
Hazan T, Polak S, Shashua A (2005) Sparse image coding using a 3d non-negative tensor factorization. In: Proceedings of the 10th ICCV conference
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196. doi:10.1023/A:1007617005950
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd SIGIR conference
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Lafferty JD, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp 111–119
Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. In: NIPS
Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) FacetNet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th WWW conference
Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2009a) Analyzing communities and their evolutions in dynamic social networks. ACM Trans Knowl Discov Data 3(2):8:1–8:31. doi:10.1145/1514888.1514891
Lin Y-R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A (2009b) MetaFac: community discovery via relational hypergraph factorization. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining
Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of the 13th SIGKDD conference
Mørup M, Hansen LK, Arnfred SM (2008) Algorithms for sparse nonnegative Tucker decompositions. Neural Comput 20(8):2112–2131
Peng W (2009) Equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval
Porteous I, Bart E, Welling M (2008) Multi-hdp: a non parametric bayesian model for tensor factorization. In: Proceedings of the 23rd national conference on artificial intelligence
Shashua A, Hazan T (2005) Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd ICML conference
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) GraphScope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th SIGKDD conference
Sun J-T, Zeng H-J, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the 14th WWW conference
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika, 31
Wang F, Li P, Knig A, Wan M (2011) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst, pp 1–32
Xiong L, Chen X, Huang T-K, Schneider J, Carbonell JG (2010) Temporal collaborative filtering with bayesian probabilistic tensor factorization. In: SDM
Zaragoza H, Hiemstra D, Tipping ME (2003) Bayesian extension to the language model for ad hoc information retrieval. In: SIGIR, pp 4–9
Zhang Z-Y, Li T, Ding C (2011) Non-negative tri-factor tensor decomposition with applications. Knowl Inf Syst, pp 1–23
Zhou D, Zhu S, Yu K, Song X, Tseng BL, Zha H, Giles CL (2008) Learning multiple graphs for document recommendations. In: WWW, pp 141–150
Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: SIGIR
Acknowledgments
The authors would like to thank Professor C. Lee Giles for providing the CiteSeer data set and thank Koji Hino and Junichi Tatemura for helping prepare the blog data set.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof for Theorem 1
Proof
Assume that the values obtained from the previous iteration are \(\tilde{X}, \tilde{Y}, \tilde{Z}\), and \(\tilde{\mathcal{C }}\), respectively. We prove the update rule for \(X\). The rules for \(Y, Z\), and \(\mathcal C \) can be proved similarly. For the update rule of \(X\), we can consider \(Y, Z\), and \(\mathcal C \) as fixed (i.e., fixed as their values \(\tilde{Y}, \tilde{Z}\), and \(\tilde{\mathcal{C }}\) in the previous iteration). To avoid notation clutters, we define \(\tilde{\tilde{X}}\doteq X_{B}\tilde{X}, \tilde{\tilde{Y}}\doteq Y_{B}\tilde{Y}, \tilde{\tilde{Z}}=Z_{B}\tilde{Z}\), and we rewrite the objective function as
First define
and
where obviously we have \(\sum _{ijklmnl^{\prime }} \theta _{ijklmnl^{\prime }} = 1\).
Then we have
where \(c_{1}, c_{2}\), and \(c_{3}\) are constants irrelevant to \(X\). Note that in the last step of the above derivation, we used the fact that \(\sum _{ijklmnl^{\prime }}\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} = \sum _{ijk}\tilde{\mathcal{C }}_{lmn}\) because the columns of \(X\) all sum to 1.
It can be easily shown that \(Q_{X}(X;\tilde{X})\) is an auxiliary function of \(D_{X}(X)\) in the sense that
With such an auxiliary function, we can use the following EM-style argument to show that \({X}^{*}=\mathop {\arg \,\min }\limits _{X} Q(X;\tilde{X})\) actually reduces \(D_{X}(X)\), namely \(D_{X}({X}^{*}) \le D_{X}(\tilde{X})\) in guaranteed:
So the problem is reduced to minimizing \(Q_{X}(X;\tilde{X})\) with respect to \(X\), under the constraint that all the columns of \(X\) sum to ones. We define the Lagrangian
and by taking its derivative and setting the result to zero, we have
which gives the update rule for \(X\) in Theorem 1. \(\square \)
1.2 Proof for Corollary 1
A simplex space of dimension \(p-1\) shrunk by \(\epsilon \), where \(p \epsilon <1\), is defined as
For a given \(\zeta \in \mathbb R ^p\), and \(\max \{\zeta _k\}>0\), we define a projection
Lemma 2
The loss function is defined as
where \(a_{i}\ge 0, \sum _{i} a_{i} > p(1-\alpha )\), and \(w_{ik}>0\). For \(x \in \mathbb S ^{p-1}_{\epsilon }\), where \(\epsilon < 1/p\), if
then \(\xi \in \mathbb S ^{p-1}_{\epsilon }\) and \(f(\xi ) \le f(x)\) .
Proof
We introduce an auxiliary function,
We have \(g(x;x)=f(x)\) and \(g(z;x)\ge f(z)\) for any \(z\), because of convexity of \(-\ln (x)\) in the first term.
We have \(\max \{\zeta _{k}\} > 0\), because \(\sum _{k} \zeta _{k} = \sum _{i} a_{i} + p\alpha -p >0\). We have \(\lambda > 0\), because \(\lambda = \zeta _{k} /\xi _{k} +\gamma _{k} \ge \zeta _{k} /\xi _k \ge \max \{\zeta _{k}\} /\epsilon > 0\). Thus, \(\xi = \mathcal{P}^{p}_{\epsilon } \zeta \) minimizes \(g(z;x)\). Therefore, \(f(\xi ) \le g(\xi ;x) \le g(x;x)=f(x)\).
Although we do not have explicit equation for \(\mathcal{P}^{p}_{\epsilon }\), inspired by [12], we can solve \(\mathcal{P}^{p}_{\epsilon } \zeta \) efficiently.
The Lagrangian for Eq. (18) is
where \(\gamma _{k} \ge 0\). With KKT condition, we have \( -\frac{1}{\xi _{k}} \zeta _{k} +\lambda - \gamma _{k} =0 , \sum _{k} \xi _{k} =1\), and \(\gamma _{k} =0\) if \(\xi _{k} >\epsilon \).
We prove that \(\xi _{k} \ge \xi _{l}\) if \(\zeta _{k} \ge \zeta _{l}\). Suppose that \(\xi _{l} > \xi _{k}\). If \(\xi _{l} > \xi _{k} > \epsilon \), it is contradicted by \(\zeta _{l} = \lambda \xi _{l} > \lambda \xi _{k} = \zeta _{k}\). If \(\xi _{l} > \xi _{k} = \epsilon \), it is contradicted by \(\zeta _{l} = \lambda \xi _{l} > \lambda \epsilon \ge (\lambda -\gamma _{k}) \xi _{k} = \zeta _{k}\).
We look for \(\omega \), such that \(\xi _{k} = \epsilon \) iff \(\zeta _{k} \le \omega \). Thus \(\xi _{k} = \zeta _{k} /\lambda > \epsilon \) if \(\zeta _{k} > \omega \). Let \(\mathbb A _{\omega } =\{k: \zeta _{k} \le \omega \}\), then
So \(\lambda = \frac{\sum _{k \not \in \mathbb A _{\omega }} \zeta _{k}}{1-|\mathbb A _{\omega }|\epsilon }\). Because \(\xi _{\omega } = \epsilon , \lambda \ge \omega / \xi _{\omega }\), thus
Since \(\max \{\zeta _{k}\} > 0\) and \(\epsilon < 1/p\), we can find a valid \(\omega \) such that \(\omega \ne \max \{\zeta _{k}\}\). We select the largest \(\omega \) that satisfies Eq. (19).
With the analysis above, we solve \(\mathcal{P}^{p}_{\epsilon }\) by Algorithm 7.1.
Proof of Corollary 1
Each of \(\mathcal C, X, Y\), and \(Z\) can be updated using Lemma 2 after certain reshape. Thus we can sequentially minimize the loss function of Eq. (5).\(\square \)
Rights and permissions
About this article
Cite this article
Chi, Y., Zhu, S. FacetCube: a general framework for non-negative tensor factorization. Knowl Inf Syst 37, 155–179 (2013). https://doi.org/10.1007/s10115-012-0566-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0566-x