FacetCube: a general framework for non-negative tensor factorization

Chi, Yun; Zhu, Shenghuo

doi:10.1007/s10115-012-0566-x

FacetCube: a general framework for non-negative tensor factorization

Regular Paper
Published: 26 October 2012

Volume 37, pages 155–179, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yun Chi¹ &
Shenghuo Zhu¹

310 Accesses
2 Citations
Explore all metrics

Abstract

Non-negative tensor factorization (NTF) has been successfully used to extract significant characteristics from polyadic data, such as data in social networks. Because these polyadic data have multiple dimensions (e.g., the author, content, and timestamp of a blog post), NTF fits in naturally and extracts data characteristics jointly from different data dimensions. In the traditional NTF, all information comes from the observed data, and therefore, the end users have no control over the outcomes. However, in many applications very often, the end users have certain prior knowledge, such as the demographic information about individuals in a social network or a pre-constructed ontology on the contents and therefore prefer the data characteristics extracting by NTF being consistent with such prior knowledge. To allow users’ prior knowledge to be naturally incorporated into NTF, in this paper, we present a general framework—FacetCube—that extends the standard NTF. The new framework allows the end users to control the factorization outputs at three different levels for each of the data dimensions. The proposed framework is intuitively appealing in that it has a close connection to the probabilistic generative models. In addition to introducing the framework, we provide an iterative algorithm for computing the optimal solution to the framework. We also develop an efficient implementation of the algorithm that consists of several techniques to make our framework scalable to large data sets. Extensive experimental studies on a paper citation data set and a blog data set demonstrate that our new framework is able to effectively incorporate users’ prior knowledge, improves performance over the traditional NTF on the task of personalized recommendation, and is scalable to large data sets from real-life applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Customer segmentation using online platforms: isolating behavioral and demographic segments for persona creation via aggregated user data

Article 23 August 2018

Advances in Collaborative Filtering

The homophily principle in social network analysis: A survey

Article 18 January 2022

Notes

FacetCube stands for “factorize data using NTF with co-ordinates being unconstrained, basis-constrained, or fixed”.
For the discussion in this part, we restrict attention to the case of unconstrained dimensions, because the computations for $X_{B}X, Y_{B}Y$, and $Z_{B}Z$ do not dominate the time complexity.
http://citeseer.ist.psu.edu/.
http://opennlp.sourceforge.net/.

References

Aussenac-Gilles N, Mothe J (2004) Ontologies as background knowledge to explore document collections. In: RIAO, pp 129–142
Banerjee A, Basu S, Merugu S (2007) Multi-way clustering on relation graphs. In: SIAM international conference on data mining
Carroll JD, Pruzansky S, Kruskal JB (1980) CANDELINC: a general approach to multi- dimensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45
Chew PA, Bader BW, Kolda TG, Abdelali A (2007) Cross-language information retrieval using PARAFAC2. In: Proceedings of the 13th SIGKDD conference
Chi Y, Zhu S (2010) FacetCube: a framework of incorporating prior knowledge into non-negative tensor factorization. In: Proceedings of the 19th CIKM conference
Chi Y, Zhu S, Song X, Tatemura J, Tseng BL (2007) Structural and temporal analysis of the blogosphere through community factorization. In: Proceedings of the 13th SIGKDD conference
Chi Y, Zhu S, Gong Y, Zhang Y (2008) Probabilistic polyadic factorization and its application to personalized recommendation. In: Proceedings of the 17th CIKM conference
Chi Y, Zhu S, Hino K, Gong Y, Zhang Y (2009) iOLAP: a framework for analyzing the internet, social networks, and other networked data. IEEE Trans Multimedia 11(3):372–382
Article Google Scholar
De Lathauwer L, De Moor B, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Anal Appl 21(4):1253–1278. doi:10.1137/S0895479896305696
Google Scholar
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD conference
Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM SDM
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th international conference on machine learning
Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems
FitzGerald D, Cranitch M, Coyle E (2005) Non-negative tensor factorisation for sound source separation. In: Proceedings of the irish signals and systems conference
Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: Proceedings of the 28th SIGIR conference
Harshman RA (1970) Foundations of the parafac procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16
Hazan T, Polak S, Shashua A (2005) Sparse image coding using a 3d non-negative tensor factorization. In: Proceedings of the 10th ICCV conference
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196. doi:10.1023/A:1007617005950
Google Scholar
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd SIGIR conference
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Article MathSciNet MATH Google Scholar
Lafferty JD, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp 111–119
Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. In: NIPS
Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) FacetNet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th WWW conference
Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2009a) Analyzing communities and their evolutions in dynamic social networks. ACM Trans Knowl Discov Data 3(2):8:1–8:31. doi:10.1145/1514888.1514891
Lin Y-R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A (2009b) MetaFac: community discovery via relational hypergraph factorization. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining
Long B, Zhang Z, Yu PS (2007) A probabilistic framework for relational clustering. In: Proceedings of the 13th SIGKDD conference
Mørup M, Hansen LK, Arnfred SM (2008) Algorithms for sparse nonnegative Tucker decompositions. Neural Comput 20(8):2112–2131
Article Google Scholar
Peng W (2009) Equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval
Porteous I, Bart E, Welling M (2008) Multi-hdp: a non parametric bayesian model for tensor factorization. In: Proceedings of the 23rd national conference on artificial intelligence
Shashua A, Hazan T (2005) Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd ICML conference
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) GraphScope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th SIGKDD conference
Sun J-T, Zeng H-J, Liu H, Lu Y, Chen Z (2005) CubeSVD: a novel approach to personalized web search. In: Proceedings of the 14th WWW conference
Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika, 31
Wang F, Li P, Knig A, Wan M (2011) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst, pp 1–32
Xiong L, Chen X, Huang T-K, Schneider J, Carbonell JG (2010) Temporal collaborative filtering with bayesian probabilistic tensor factorization. In: SDM
Zaragoza H, Hiemstra D, Tipping ME (2003) Bayesian extension to the language model for ad hoc information retrieval. In: SIGIR, pp 4–9
Zhang Z-Y, Li T, Ding C (2011) Non-negative tri-factor tensor decomposition with applications. Knowl Inf Syst, pp 1–23
Zhou D, Zhu S, Yu K, Song X, Tseng BL, Zha H, Giles CL (2008) Learning multiple graphs for document recommendations. In: WWW, pp 141–150
Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification using matrix factorization. In: SIGIR

Download references

Acknowledgments

The authors would like to thank Professor C. Lee Giles for providing the CiteSeer data set and thank Koji Hino and Junichi Tatemura for helping prepare the blog data set.

Author information

Authors and Affiliations

NEC Laboratories America, Inc., 10080 North Wolfe Road, SW3-350, Cupertino, CA, 95014, USA
Yun Chi & Shenghuo Zhu

Authors

Yun Chi
View author publications
You can also search for this author in PubMed Google Scholar
Shenghuo Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Chi.

Appendix

1.1 Proof for Theorem 1

Proof

Assume that the values obtained from the previous iteration are $\tilde{X}, \tilde{Y}, \tilde{Z}$, and $\tilde{\mathcal{C }}$, respectively. We prove the update rule for $X$. The rules for $Y, Z$, and $\mathcal C $ can be proved similarly. For the update rule of $X$, we can consider $Y, Z$, and $\mathcal C $ as fixed (i.e., fixed as their values $\tilde{Y}, \tilde{Z}$, and $\tilde{\mathcal{C }}$ in the previous iteration). To avoid notation clutters, we define $\tilde{\tilde{X}}\doteq X_{B}\tilde{X}, \tilde{\tilde{Y}}\doteq Y_{B}\tilde{Y}, \tilde{\tilde{Z}}=Z_{B}\tilde{Z}$, and we rewrite the objective function as

$$\begin{aligned} \min D_{X}(X) = \min KL(\mathcal A || [\tilde{\mathcal{C }}, X_{B}X, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]) \end{aligned}$$

First define

$$\begin{aligned} \gamma _{ijklmnl^{\prime }} = \tilde{\mathcal{C }}_{lmn} (X_B)_{il^{\prime }}\tilde{X}_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \end{aligned}$$

and

$$\begin{aligned} \theta _{ijklmnl^{\prime }} = \frac{\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}\tilde{X}_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}}{[\tilde{\mathcal{C }}, \tilde{\tilde{X}}, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]_{ijk}} = \frac{\gamma _{ijklmnl^{\prime }}}{[\tilde{\mathcal{C }}, \tilde{\tilde{X}}, \tilde{\tilde{Y}}, \tilde{\tilde{Z}}]_{ijk}} \end{aligned}$$

where obviously we have $\sum _{ijklmnl^{\prime }} \theta _{ijklmnl^{\prime }} = 1$.

Then we have

$$\begin{aligned} \begin{aligned}&D_{X}(X) \\ =&\sum _{ijk}\left[\sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right.\\&\left. - \mathcal A _{ijk}\ln \left( \sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}\right) \right] + c_{1}\\ \le&\sum _{ijklmnl^{\prime }} \left[ \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right.\\&\left. - \mathcal A _{ijk} \theta _{ijklmnl^{\prime }} \ln \frac{\sum _{lmnl^{\prime }} \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}}{\theta _{ijklmnl^{\prime }}} \right] + c_{1}\\ =&\sum _{ijklmnl^{\prime }} \left[ \tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right. \\&\left. - \gamma _{ijklmnl^{\prime }} \tilde{\mathcal{B }}_{ijk} \ln \left(\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right) \right] + c_{2}\\ =&- \sum _{ijklmnl^{\prime }} \left[ \gamma _{ijklmnl^{\prime }} \tilde{\mathcal{B }}_{ijk} \ln \left(\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} \right) \right] + c_{3}\\ \doteq&Q(X;\tilde{X}), \end{aligned} \end{aligned}$$

where $c_{1}, c_{2}$, and $c_{3}$ are constants irrelevant to $X$. Note that in the last step of the above derivation, we used the fact that $\sum _{ijklmnl^{\prime }}\tilde{\mathcal{C }}_{lmn} (X_{B})_{il^{\prime }}X_{l^{\prime }l}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn} = \sum _{ijk}\tilde{\mathcal{C }}_{lmn}$ because the columns of $X$ all sum to 1.

It can be easily shown that $Q_{X}(X;\tilde{X})$ is an auxiliary function of $D_{X}(X)$ in the sense that

$$\begin{aligned} D_{X}(X)&\le Q_{X}(X;\tilde{X})\text{,} \text{ and}\end{aligned}$$

(16)

$$\begin{aligned} D_{X}(X)&= Q_{X}(X;X). \end{aligned}$$

(17)

With such an auxiliary function, we can use the following EM-style argument to show that ${X}^{*}=\mathop {\arg \,\min }\limits _{X} Q(X;\tilde{X})$ actually reduces $D_{X}(X)$, namely $D_{X}({X}^{*}) \le D_{X}(\tilde{X})$ in guaranteed:

$$\begin{aligned} D_{X}(\tilde{X})&= Q_{X}(\tilde{X};\tilde{X})\,\, (\text{ by} \text{ using} \text{ Equation}\quad (17))\\&\ge Q_{X}({X}^{*};\tilde{X})\\&\ge D_{X}({X}^{*})\,\,(\text{ by} \text{ using} \text{ Equation}\quad (16)) \end{aligned}$$

So the problem is reduced to minimizing $Q_{X}(X;\tilde{X})$ with respect to $X$, under the constraint that all the columns of $X$ sum to ones. We define the Lagrangian

$$\begin{aligned} L(X,\vec {\lambda }) = Q(X;\tilde{X}) + \vec {\lambda }^{T}({X}^{T}\vec {1}_{L}-\vec {1}_{L^{\prime }}), \end{aligned}$$

and by taking its derivative and setting the result to zero, we have

$$\begin{aligned} \frac{\partial L}{\partial X_{l^{\prime }l}}&= \frac{\tilde{X}_{l^{\prime }l}}{X_{l^{\prime }l}}\sum _{ijkmn}\tilde{\mathcal{B }}_{ijk}\tilde{\mathcal{C }}_{lmn}\tilde{\tilde{Y}}_{jm}\tilde{\tilde{Z}}_{kn}+\lambda _{l} = 0\\ \frac{\partial L}{\partial \lambda _l}&= \sum _{l^{\prime }} X_{l^{\prime }l}-1=0 \end{aligned}$$

which gives the update rule for $X$ in Theorem 1. $\square $

1.2 Proof for Corollary 1

A simplex space of dimension $p-1$ shrunk by $\epsilon $, where $p \epsilon <1$, is defined as

$$\begin{aligned} \mathbb S ^{p-1}_{\epsilon } = \left\{ x \in \mathbb R ^{p}_{+}: {x}_{k}\ge \epsilon \mathrm ~and~ \sum _{k=1}^p x_{k} =1\right\} . \end{aligned}$$

For a given $\zeta \in \mathbb R ^p$, and $\max \{\zeta _k\}>0$, we define a projection

$$\begin{aligned} \mathcal{P}^{p}_{\epsilon } \zeta = \mathop {\arg \,\min }\limits _{\xi \in \mathbb S ^{p-1}_{\epsilon }} - \sum _{k} \zeta _{k} \ln \xi _{k}. \end{aligned}$$

(18)

Lemma 2

The loss function is defined as

$$\begin{aligned} f(x) = -\sum _{i} a_{i} \ln \left(\sum _{k} w_{ik} x_{k}\right) - (\alpha -1) \sum _{k} \ln (x_{k}), \end{aligned}$$

where $a_{i}\ge 0, \sum _{i} a_{i} > p(1-\alpha )$, and $w_{ik}>0$. For $x \in \mathbb S ^{p-1}_{\epsilon }$, where $\epsilon < 1/p$, if

$$\begin{aligned} \begin{aligned} b_{i}&= \frac{a_{i} }{\sum _{k} w_{ik} x_{k}}, \\ \zeta _{k}&= x_{k} \sum _{i} {b}_{i} {w}_{ik} + \alpha -1, \\ \xi&= \mathcal{P}^{p}_{\epsilon } \zeta . \end{aligned} \end{aligned}$$

then $\xi \in \mathbb S ^{p-1}_{\epsilon }$ and $f(\xi ) \le f(x)$ .

Proof

We introduce an auxiliary function,

$$\begin{aligned} \begin{aligned} g(z; x)&= -\sum _{ik}{b_{i}w_{ik}x_{k}} \ln (z_{k}) +\sum _{ik}{b_{i}w_{ik}x_{k}} \ln \left(\frac{x_{k}}{\sum _{k} w_{ik} x_{k}}\right) - (\alpha -1) \sum _{k} \ln (z_{k}) \\&= -\sum _{k} \zeta _{k} \ln (z_{k}) +\sum _{ik}{b_{i}w_{ik}x_{k}} \ln \left(\frac{x_{k}}{\sum _{k} w_{ik} x_{k}}\right) \end{aligned} \end{aligned}$$

We have $g(x;x)=f(x)$ and $g(z;x)\ge f(z)$ for any $z$, because of convexity of $-\ln (x)$ in the first term.

We have $\max \{\zeta _{k}\} > 0$, because $\sum _{k} \zeta _{k} = \sum _{i} a_{i} + p\alpha -p >0$. We have $\lambda > 0$, because $\lambda = \zeta _{k} /\xi _{k} +\gamma _{k} \ge \zeta _{k} /\xi _k \ge \max \{\zeta _{k}\} /\epsilon > 0$. Thus, $\xi = \mathcal{P}^{p}_{\epsilon } \zeta $ minimizes $g(z;x)$. Therefore, $f(\xi ) \le g(\xi ;x) \le g(x;x)=f(x)$.

Although we do not have explicit equation for $\mathcal{P}^{p}_{\epsilon }$, inspired by [12], we can solve $\mathcal{P}^{p}_{\epsilon } \zeta $ efficiently.

The Lagrangian for Eq. (18) is

$$\begin{aligned} \mathcal{L}= - \sum _{k} \zeta _{k} \ln \xi _{k} + \lambda \left(\sum _{k} \xi _{k} -1\right) + \sum _{k} \gamma _{k} (\epsilon - \xi _{k}), \end{aligned}$$

where $\gamma _{k} \ge 0$. With KKT condition, we have $ -\frac{1}{\xi _{k}} \zeta _{k} +\lambda - \gamma _{k} =0 , \sum _{k} \xi _{k} =1$, and $\gamma _{k} =0$ if $\xi _{k} >\epsilon $.

We prove that $\xi _{k} \ge \xi _{l}$ if $\zeta _{k} \ge \zeta _{l}$. Suppose that $\xi _{l} > \xi _{k}$. If $\xi _{l} > \xi _{k} > \epsilon $, it is contradicted by $\zeta _{l} = \lambda \xi _{l} > \lambda \xi _{k} = \zeta _{k}$. If $\xi _{l} > \xi _{k} = \epsilon $, it is contradicted by $\zeta _{l} = \lambda \xi _{l} > \lambda \epsilon \ge (\lambda -\gamma _{k}) \xi _{k} = \zeta _{k}$.

We look for $\omega $, such that $\xi _{k} = \epsilon $ iff $\zeta _{k} \le \omega $. Thus $\xi _{k} = \zeta _{k} /\lambda > \epsilon $ if $\zeta _{k} > \omega $. Let $\mathbb A _{\omega } =\{k: \zeta _{k} \le \omega \}$, then

$$\begin{aligned} \frac{1}{\lambda }\sum _{k \not \in \mathbb A _{\omega }} \zeta _{k} + |\mathbb A _{\omega }| \epsilon = 1. \end{aligned}$$

So $\lambda = \frac{\sum _{k \not \in \mathbb A _{\omega }} \zeta _{k}}{1-|\mathbb A _{\omega }|\epsilon }$. Because $\xi _{\omega } = \epsilon , \lambda \ge \omega / \xi _{\omega }$, thus

$$\begin{aligned} \epsilon \sum _{\zeta _{k} > \omega } \zeta _{k} \ge \omega (1-|\{\zeta _{k} \le \omega \}|\epsilon ) . \end{aligned}$$

(19)

Since $\max \{\zeta _{k}\} > 0$ and $\epsilon < 1/p$, we can find a valid $\omega $ such that $\omega \ne \max \{\zeta _{k}\}$. We select the largest $\omega $ that satisfies Eq. (19).

With the analysis above, we solve $\mathcal{P}^{p}_{\epsilon }$ by Algorithm 7.1.

Proof of Corollary 1

Each of $\mathcal C, X, Y$, and $Z$ can be updated using Lemma 2 after certain reshape. Thus we can sequentially minimize the loss function of Eq. (5).$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, Y., Zhu, S. FacetCube: a general framework for non-negative tensor factorization. Knowl Inf Syst 37, 155–179 (2013). https://doi.org/10.1007/s10115-012-0566-x

Download citation

Received: 05 March 2010
Revised: 23 July 2012
Accepted: 29 July 2012
Published: 26 October 2012
Issue Date: October 2013
DOI: https://doi.org/10.1007/s10115-012-0566-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FacetCube: a general framework for non-negative tensor factorization

Abstract

Access this article

Similar content being viewed by others