1 Introduction

Recommender systems have proven to be a valuable component in many applications of personalization and Internet economy. Traditional recommender systems try to estimate a score function mapping each pair of user and item to a scalar value using the information of previous items already rated or interacted by the user [1]. Recent methods have been successful in integrating side information as content of the item, user context, social network, item topics, etc. For this purpose a variety of features should be taken into consideration, such as the routine, the geolocation, spatial correlation of certain preferences, mood and sentiment analysis, as well as social relationships such as “friendship” to others users or “belonging” to a community in a social network [2]. In particular, a rich area of research has explored the integration of topic models and collaborative filtering approaches using principled probabilistic models [3,4,5]. Another group of models has been developed to integrate social network information into recommender systems using user–item ratings with extra dependencies [6] or constraining and regularizing directly the user latent factors with social features [7, 8]. Finally, some models have focused on the collective learning of both social features and content features, constructing hybrid recommender systems [5, 9, 10].

Our contribution is situated within all these three groups of efforts: we propose a probabilistic model that generalizes both previous models by jointly modeling content and social factors in the preference model applying Poisson-Gamma latent variable models to model the non-negativeness of the user–item ratings and induce sparse non-negative latent representation. Using this joint model we can generate recommendations based on the estimated score of non-observed items. In this article, we formulate the problem (Sect. 1.1), describe the proposed model (Sect. 3), present the variational inference algorithm (Sect. 4) and discuss the empirical results (Sect. 5). Our results indicate improved performance when compared to state-of-the-art methods including Correlated Topic Regression with Social Matrix Factorization (CTR-SMF) [5].

1.1 Problem formulation

Consider that given a set of observations of user–item interactions \(R_{\text {train}}=\{(u,d,R_{ud})\}\), with \(|R_{\text {train}}|=N_{\text {obs}}\ll U \times D \) (U is the number of users and D the number of documents), using additional item content information and user social network, we aim to learn a function f that estimates the value of each user–item interactions for all pairs of user and items \(R_{\text {complete}}=\{(u,d,f(u,d)) \}\). In general to solve this problem we assume that users have a set of preferences, and (using matrix factorization) we model these preferences using latent vectors.

Therefore, we have the documents (or items) set \(\mathcal {D}\) of size \(|\mathcal {D}|=D\), vocabulary set \(\mathcal {V}\) of size \(|\mathcal {V}|=V\), users set \(\mathcal {U}\) of size \(|\mathcal {U}|=U\), the social network given by the set of neighbors for each user \(\{N(u)\}_{u \in \mathcal {U}}\). So, given the partially observed user–item matrix with integer ratings or implicit counts \(\varvec{R}=(R_{ud}) \in \mathbb {N}^{U \times D}\), the observed document–word count matrix \(\varvec{W}=(W_{dv}) \in \mathbb {N}^{D \times V} \), and the user social network \(\{N(u)\}_{u \in \mathcal {U}}\), we need to estimate a matrix \(\varvec{\widetilde{R}} \in \mathbb {N}^{U \times D}\) to complete the user–item matrix \(\varvec{R}\). Finally, with the estimated matrix we can rank the unseen items for each user and make recommendations.

2 Related Work

Collaborative Topic Regression (CTR): CTR [3] is a probabilistic model combining topic modeling (using Latent Dirichlet Allocation) and probabilistic matrix factorization (using Gaussian likelihood). Collaborative Topic Regression with Social Matrix Factorization (CTR-SMF) [5] builds upon CTR adding social matrix factorization, creating a joint model Gaussian factorization model with content and social side information. Limited Attention Collaborative Topic Regression (LA-CTR) [9], is another approach with which the authors propose a joint model based on CTR integrating behavioral mechanism of attention. In this case, the amount of attention the user has invested in the social network is limited, and there is a measure of influence implying that the user may favor some friends more than others. In [10], the authors propose a CTR model seamlessly integrated item–tags, item content and social network information. All the models mentioned above combine in some degree LDA with Gaussian based matrix factorization for recommendations. Thus the time complexity for training those models is dominated by LDA complexity, making them difficult to scale. Also, the combination of LDA and Gaussian matrix factorization in CTR is a non-conjugate model that is hard to fit and difficult to work with sparse data.

Poisson Factorization: The basic Poisson factorization is a probabilistic model for non-negative matrix factorization based on the assumption that each user–item interaction \(R_{ui}\) can be modelled as a inner product of a user K dimensional latent vector \(\varvec{U}_{u}\) and item latent vector \(\varvec{V}_{i}\) representing the unobserved user preferences and item attributes [11], so that \(R_{ui} \sim \text {Poisson}(\varvec{U_{u}}^T\varvec{V_{i}})\). Poisson factorization models for recommender systems have the advantage of principled modeling of implicit feedback, generating sparse latent representations, fast approximate inference with sparse matrix (the likelihood depends only on the consumed items) and improved empirical results compared with the Gaussian-based models [11, 12]. Nonparametric Poisson factorization model (BNPPF) [12] extends basic Poisson factorization by drawing user weights from a Gamma process. The latent dimensionality in this model is estimated from the data, effectively avoiding the ad hoc process of choosing the latent space dimensionality K. Social Poisson factorization (SPF) [6] extends basic Poisson factorization to accommodate preference and social based recommendations, adding a degree of trust variable and making all user–item interaction conditionally dependent on the user friends. With collaborative topic Poisson factorization (CTPF) [4], shared latent factors are utilized to fuse recommendation with topic model using Poisson likelihood and Gamma variables for both.

Non-negative matrix and tensor factorization using Poisson models: Poisson models are also successfully utilized in more general models such as tensor factorization and relational learning, particularly where it can use count data and non-negative factors. In [13], the authors propose a generic Bayesian non-negative tensor factorization model for count data and binary data. In [14], the authors explore the idea of adding constraints between the model variables using side information with hierarchical information, while the approach in [15] uses graph side information jointly modeled with topic modeling with Gamma process – a joint non-parametric model of network and documents.

3 Poisson Matrix Factorization with Content and Social Trust Information (PoissonMF-CS)

The proposed model PoissonMF-CS (see Fig. 1) is an extension and generalization of previous Poisson models, combining social factorization model (social Poisson factorization – SPF) [6], and topic based factorization (collaborative topic Poisson factorization – CTPF) [4].

The main idea is to employ shared latent Gamma factors for topical preference and trust weight variables in the user social network, combining all factors in the rate of a Poisson likelihood of the user–item interaction. We model both sources of information having an additive effect on the observed user–item interactions and add two global multiplicative weights for each group of latent factors. The intuition behind the additive effect of social trust is that users tend to interact with items presented by their peers, so we can imagine a mechanism of “peer pressure” operating, where items offered through the social network have a positive (or neutral) influence on the user. In other words, we believe there is a positive social bias more than an anti-social bias, and we factor this in PoissonMF-CS model.

In the case of Poisson models, this non-negative constraint results in sparseness in the latent factors and can help avoid over-fitting (in comparison the Gaussian-based models [11, 12]). Gamma priors on the latent factors, and the fact that the latent factors can only have a positive or a zero effect on the final prediction, induce sparse latent representations in the model. Hence, in the inference process we adjust a factor that decreases the model likelihood by making its value closer to zero.

Fig. 1.
figure 1

Plate diagram for PoissonMF-CS model

3.1 Generative Model

In this model, \(W_{dv}\) is a counting variable for the number of times word v appears in document d, \(\varvec{\beta }_{v}\) is a latent vector capturing topic distribution of word v and \(\varvec{\theta }_{d}\) is the document–topic intensity vector, both with dimensionality K. Count variable \(W_{dv}\) is parametrized by the linear combination of these two latent factors \(\varvec{\theta _{d}}^T\varvec{\beta _{v}}\). The document–topic latent factor \(\varvec{\theta }_{d}\) influences also the user–document rating variable \(R_{ud}\). Each user has a latent vector \(\varvec{\eta }_{u}\) representing the user–topic propensity, which interacts with the document topic intensity factor \(\varvec{\theta }_{d}\) and document topic offset factor \(\varvec{\epsilon }_{d}\), resulting in the term \(\varvec{\eta }_{u}^T \varvec{\theta }_{d}+\varvec{\eta }_{u}^T\varvec{\epsilon }_{d}\). Here, \(\varvec{\eta }_{u}^T\varvec{\epsilon }_{d}\) captures the baseline matrix factorization, while \(\varvec{\eta }_{u}^T \varvec{\theta }_{d}\) connects the rating variable with the content-based part of the model (word–document variable \(W_{dv}\)). The trust factor \(\tau _{ui}\) between user u to user i is equal to zero for all users that are not connected in the social network (\(\tau _{ui}>0 \Leftrightarrow i \in N(u)\)). This trust factor adds dependency between social connected users: the user–document rating \(R_{ud}\) is influenced by the average rating to item d given by friends of user u in the social network, weighted by the trust user u assigns to his friends (\(\sum _{i\in N(u)} \tau _{ui}R_{id} \)). We model this social dependency using a conditional specified model, as in [6]. The latent variables \(\lambda _C\) and \(\lambda _S\) are weight variables added in the model to capture and control the general weight of the content and social factors. These variables allow us to infer the importance of content and social factors according to the dataset or domain of usage. Also, instead of estimating these weights from the observed data, we may set \(\lambda _C\) and \(\lambda _S\) to constant values, thus controlling the importance of content and social parts of the model. Specifically if we set \(\lambda _C=0\) and \(\lambda _S=1\) we obtain the SPF model, while setting \(\lambda _C=1\) and \(\lambda _S=0\) result in CTPF, and \(\lambda _C=0\) and \(\lambda _S=0\) is equivalent to the simple Poisson matrix factorization without any side information [11].

Now we present the complete generative model assuming documents (or items) set \(\mathcal {D}\) of size \(|\mathcal {D}|=D\), vocabulary set \(\mathcal {V}\) of size \(|\mathcal {V}|=V\), users set \(\mathcal {U}\) of size \(|\mathcal {U}|=U\), the user social network given by the set of neighbors for each user \(\{N(u)\}_{u \in \mathcal {U}}\) D documents, and K latent factors (topics) (with an index set \(\mathcal {K}\)).

  1. 1.

    Latent parameter distributions:

    1. (a)

      for all topics \(k \in \mathcal {K}\):

      • for all words \(v \in \mathcal {V}\): \(\beta _{vk} \sim \text {Gamma}(a^0_{\beta },b^0_{\beta })\)

      • for all documents \(d \in \mathcal {D}\): \(\theta _{dk} \sim \text {Gamma}(a^0_{\theta },b^0_{\theta }) \) and \(\epsilon _{dk} \sim \text {Gamma}(a^0_{\epsilon },b^0_{\epsilon })\)

      • for all users \(u \in \mathcal {U}\): \(\eta _{uk} \sim \text {Gamma}(a^0_{\eta },b^0_{\eta })\)

      • for all user \(i \in N(u)\): \(\tau _{ui} \sim \text {Gamma}(a^0_{\tau },b^0_{\tau })\)

    2. (b)

      Content weight: \(\lambda _C \sim \text {Gamma}(a^0_{C},b^0_{C})\)

    3. (c)

      Social weight: \(\lambda _S \sim \text {Gamma}(a^0_{S},b^0_{S})\)

  2. 2.

    Observations probability distribution:

    1. (a)

      for all observed document–word pairs dv :

      $$ W_{dv} | \varvec{\beta }_{v},\varvec{\theta }_{d} \sim \text {Poisson}(\varvec{\beta _{v}}^T\varvec{\theta _{d}}) $$
    2. (b)

      for all observed user–document pairs ud :

      $$R_{ud}|\varvec{R_{N(u),d},\eta _{u},\epsilon _{d},\theta _{d}} \sim \text {Poisson}( \lambda _C \varvec{\eta }_{u}^T\varvec{\theta }_{d}+\varvec{\eta }_{u}^T\varvec{\epsilon }_{d}+\lambda _S \sum _{i\in N(u)} \tau _{ui}R_{id} )$$

4 Inference

First, we add a set of auxiliary latent Poisson variables to facilitate the posterior inference of the model. By doing so, the extended model will be complete conjugate, and consequently have analytical equations for the complete conditionals and variational updates [16]. In Appendix A we show that those auxiliary variables can be seen as by-product of a lower bound on the expected value of the log sum of the latent random variables. Variable \(Y_{dv,k}\) represent a topic specific latent count for a word–document pair, so that the observed word–document counts is a sum of the latent counts (a property of the Poisson distribution)Footnote 1. We can perform a similar modification for the user–item counts, splitting the latent terms of \(R_{ud}\) rate into two groups of topic specific latent count allocation variables: \(Z_{ud,k}^M\) for the item content part, \(Z_{ud,k}^N\) for the collaborative filtering part and \(Z_{ud,i}^S\) for the social trust part (for this part, the intuitive explanation for the latent dimension is the idea of friend specific allocation of trust). The sum of all those latent counts is the observed user–item interaction count variable \(R_{ud}\).

$$\begin{aligned} \begin{array}{rl} Y_{dv,k}|\beta _{vk},\theta _{dk} &{}\sim \text {Poisson}(\beta _{vk}\theta _{dk}) \\ Z_{ud,k}^M|\lambda _C,\eta _{uk},\theta _{dk} &{}\sim \text {Poisson}(\lambda _C\eta _{uk}\theta _{dk}) \\ Z_{ud,k}^N|\eta _{uk},\epsilon _{dk} &{}\sim \text {Poisson}(\eta _{uk}\epsilon _{dk}) \\ Z_{ud,i}^S|\lambda _S,\tau _{ui},R_{id} &{}\sim \text {Poisson}(\lambda _S\tau _{ui}R_{id}) \end{array} \end{aligned}$$
(1)
$$ \text {with }\sum _k Y_{dv,k} = W_{dv} \text {, and }\sum _k Z_{ud,k}^M+Z_{ud,k}^N+\sum _{i \in N(U) }Z_{ud,i}^S=R_{ud}.$$

The inference problem consists on the estimation of the posterior distribution of the latent variables given the observed rating \(\varvec{R}\), the observed document–word counts \(\varvec{W}\), and the user social network \(\{N(u)\}_{u \in \mathcal {U}}\), in other words, computing

$$p(\varTheta |\varvec{R},\varvec{W},\{N(u)\}_{u \in \mathcal {U}}),$$

where \( \varTheta =\{\varvec{\beta ,\theta ,\eta ,\epsilon ,\tau ,y,z},\lambda _C,\lambda _S \} \) is the set of all latent variables. The exact computation of this posterior probability is intractable for any practical scenario, so we need approximation techniques for efficient parameter learning. In our case, we apply variational techniques to derive the learning algorithm. As an intermediate step towards the variational inference algorithm, we also derive the full conditional distribution for each latent variable. The full conditional distribution of each latent variable is also useful as update equations for Gibbs sampling, meaning that we could use the resulting equations to implement a sampling-based approximation. However, sampling methods are hard to scale and usually requires more memory, so as a design choice for the implementation of the learning algorithm, we refrained from applying the Gibbs sampling method and focus on the variational inference.

In the next sections, we present the full conditional distribution of each of the latent variables in Sect. 4.1, and show the resulting update equation for the variational parameters in Sect. 4.2.

4.1 Full Conditional Distribution

The full conditional distribution of each of the latent variables is the distribution of a variable given all the other variables in the model, except the variable that we are considering. Given a set of indexed random variables \(X_k\), we use the notation \(p(X_k|X_{-k})\) (where \(X_{-k}\) means all the variables \(X_i\) such that \(i \ne k\)) to represent the full conditional distribution. Given the factorized structure of the model we can simplify the conditional set to the Markov blanket of the node we are considering (children nodes and co-parents nodes)Footnote 2 [16]. For conciseness, we show the derivations only for one Gamma latent variables and one Poisson latent count variable.

  • Gamma distributed variables: We demonstrate how to obtain the full conditional distribution for Gamma distributed variable \(\theta _{dk}\), for the remaining Gamma distributed variables we only present the end result without the intermediate steps.

    $$\begin{aligned} \begin{array}{ll} p(\theta _{dk}| *) &{}= p(\theta _{dk}| \text {MarkovBlanket}(\theta _{dk})) \\ &{} \propto p(\theta _{dk}) \prod _{v=1}^V p(Y_{dv,k}|\beta _{vk},\theta _{dk}) \prod _{u=1}^U p(Z_{ud,k}^M|\lambda _C,\eta _{uk},\theta _{dk}) \\ &{} \propto \theta _{dk}^{a^0_{\theta }-1}e^{-b^0_{\theta }\theta _{dk}} \prod _v \theta _{dk}^{Y_{dv,k}} e^{-\beta _{vk}\theta _{dk}} \prod _u \theta _{dk}^{Z_{ud,k}^M} e^{-\lambda _C\eta _{uk}\theta _{dk}} \\ &{} \propto \theta _{dk}^{a^0_{\theta }+\sum _v Y_{dv,k}+\sum _u Z_{ud,k}^M-1} e^{-\theta _{dk}(b^0_{\theta }+\sum _v \beta _{vk}+\lambda _C\sum _u \eta _{uk})} \end{array} \end{aligned}$$
    (2)

    Normalizing equation Eq. 2 over \(\theta _{dk}\) we obtain the pdf of a Gamma variable with shape \(a^0_{\theta }+\sum _v Y_{dv,k}+\sum _u Z_{ud,k}^M\) and rate \(b^0_{\theta }+\sum _v \beta _{vk}+\lambda _C\sum _u \eta _{uk}\). The final solution is written in Eq. 3. Also, notice that because of the way the model is structured all other Gamma latent variable have similar equations, the difference being the set of variables in the Markov blanket.

    $$\begin{aligned} \begin{array}{ll} \theta _{dk}| * &{}\sim \text {Gamma}(a^0_{\theta }+\sum _v Y_{dv,k}+\sum _u Z_{ud,k}^M,b^0_{\theta }+\sum _v \beta _{vk}+\lambda _C\sum _u \eta _{uk}) \\ \beta _{vk}|* &{} \sim \text {Gamma}(a^0_{\beta }+\sum _d Y_{dv,k},b^0_{\beta }+\sum _d \theta _{dk}) \\ \eta _{uk}|* &{} \sim \text {Gamma}(a^0_{\eta }+\sum _d Z_{ud,k}^M+Z_{ud,k}^N,b^0_{\eta }+\lambda _C\sum _d \theta _{dk}+\sum _d \epsilon _{dk}) \\ \epsilon _{dk}|* &{} \sim \text {Gamma}(a^0_{\epsilon }+\sum _u -Z_{ud,k}^N,b^0_{\epsilon }+\sum _u \eta _{uk})\\ \tau _{ui}|* &{} \sim \text {Gamma}(a^0_{\tau }+\sum _d Z_{ud,i}^S,b^0_{\tau }+\lambda _S\sum _d R_{id}) \\ \lambda _C|* &{} \sim \text {Gamma}(a_C+\sum _{u,d,k} Z_{ud,k}^M,b_C+\sum _{u,d,k} \eta _{uk}\theta _{dk}) \\ \lambda _S|* &{} \sim \text {Gamma}(a_S+\sum _{u,d,i} Z_{ud,i}^S,b_S+\sum _{u,d,i} \tau _{ui}R_{id}) \end{array} \end{aligned}$$
    (3)
  • Multinomial distributed (auxiliary) variables: looking at the Markov blanket of \(\varvec{Y}_{dv}\) we obtain:

    $$\begin{aligned} \begin{array}{rl} p(\varvec{Y}_{dv}| * ) &{} \propto \prod _{k=1}^K p(Y_{dv,k}|\beta _{vk},\theta _{dk}) = \prod _{k=1}^K \text {Poisson}(Y_{dv,k}|\beta _{vk}\theta _{dk}) \\ &{} \propto \prod _{k=1}^K \frac{(\beta _{vk}\theta _{dk})^{Y_{dv,k}}}{Y_{dv,k}!} \end{array} \end{aligned}$$
    (4)

    Given that we know that \(\sum _k Y_{dv,k} = W_{dv}\), this functional form is equivalent to the pdf of a Multinomial distribution with parameter probabilities proportional to \(\beta _{vk}\theta _{dk}\).

    $$\begin{aligned} \varvec{Y}_{dv}|*&\sim \text {Mult}(W_{dv};\varvec{\phi _{dv}})&\text { with }\phi _{dv,k} = \frac{\beta _{vk}\theta _{dk}}{\sum _k \beta _{vk}\theta _{dk}} \end{aligned}$$
    (5)

    Similarly, \(\varvec{Z}_{ud}\) is a Multinomial with parameters proportional to the parent nodes of \(\varvec{Z}_{ud}\). For convenience in the previous section, we split \(\varvec{Z}_{ud}\) in three blocks of variables and parameters \(\varvec{Z}_{ud}=[\varvec{Z}_{ud}^M,\varvec{Z}_{ud}^N,\varvec{Z}_{ud}^S]\) representing the different high-level parts of our model. The dimensionality of the first two blocks is the K, while for the last block is U, resulting that \(\varvec{Z}_{ud}\) has dimensionality \(2K+U\). Similarly the parameters of the \(\varvec{Z}_{ud}\) full conditional Multinomial have a block structure \(\varvec{\xi }_{ud}=[\varvec{\xi }_{ud}^M,\varvec{\xi }_{ud}^N,\varvec{\xi }_{ud}^S]\).

    $$\begin{aligned} Z_{ud}|*&\sim \text {Mult}(R_{ud};\varvec{\xi _{ud}}) \end{aligned}$$
    $$ \text {with }\xi _{ud,k} = \left\{ \begin{matrix} \xi _{ud,k}^M=\frac{\lambda _C \eta _{uk}\theta _{dk}}{\sum _k \eta _{uk}(\lambda _C\theta _{dk}+\epsilon _{dk})+\lambda _S\sum _{i\in N(u)}\tau _{ui}R_{id}} \\ \xi _{ud,k}^N=\frac{\eta _{uk}\epsilon _{dk}}{\sum _k \eta _{uk}(\lambda _C\theta _{dk}+\epsilon _{dk})+\lambda _S\sum _{i\in N(u)}\tau _{ui}R_{id}} \\ \xi _{ud,i}^S=\frac{\lambda _S \tau _{ui}R_{id}}{\sum _k \eta _{uk}(\lambda _C\theta _{dk}+\epsilon _{dk})+\lambda _S\sum _{i\in N(u)}\tau _{ui}R_{id}} \end{matrix}\right. $$

We present in next section how to use these equations to derive a deterministic optimization algorithm for approximate inference using the variational method.

4.2 Variational Inference

Given a family of surrogate distributions \(q(\varTheta | \varPsi )\) for the unobserved variables (latent terms) parametrized by variational parameters \(\varPsi \), we want to find an assignment of the variational parameters that minimize the KL-divergence between \(q(\varTheta | \varPsi )\) and \(p(\varTheta |\varvec{R},\varvec{W})\) Footnote 3,

$$ \mathop {\text {arg min}}\limits _{\varPsi } \text {KL}\{q(\varTheta | \varPsi ), p(\varTheta |\varvec{R},\varvec{W}) \}. $$

Then, the optimal surrogate distribution can be used as an approximation the true posterior. However, the optimization problem using directly the KL divergence is not tractable, since it depends on the computation of the evidence \(\log p(\varvec{R},\varvec{W})\). This can be accomplished using Jensen inequality to get lower bounds on the evidence and changing the optimization objective to this lower bound – the Evidence Lower BOund (ELBO):

$$\begin{aligned} \mathop {\text {arg min}}\limits _{\varPsi } L(\varPsi )&= \text {E}_q[\log p(\varvec{R},\varvec{W},\varTheta ) - \log q(\varTheta | \varPsi )] \end{aligned}$$

Another ingredient in this approximation is the mean field assumption. It consists in assuming that all variables in the variational distribution \(q(\varTheta | \varPsi )\) are mutually independent. As a result the variational surrogate distribution can be expressed as a factorized distribution of each latent factor (Eq. 7). Another implication is that we can compute the updates for each variational \(X_i\) factor using the complete conditional of the latent factor [16]. Finally, the inference algorithm consists in iterative updating of variational parameters of each factorized distribution until convergence is reached, resulting in the coordinate ascent variational inference algorithm based on the following equation:

$$\begin{aligned} q(X_i) \propto \exp \{ \text {E}_{q}[\log p(X_i|\text {*})] \} \end{aligned}$$
(6)

Using Eq. 6, we can take each complete conditional variable that we described in the previous section and create a respective proposal distribution for the variational inference. This proposal distribution is in the same family as the full conditional distribution of the latent variables, meaning that we have a group of Gamma and Multinomial variables. As long as we update the parameters of the variational distribution using Eq. 6, it is guaranteed to minimize the KL divergence between the surrogate variational distribution (Eq. 7) over the latent variables and the posterior distribution of the model.

$$\begin{aligned} \begin{array}{rl} q(\varTheta | \varPsi ) &{}= q(\lambda _C|a_{\lambda _C},b_{\lambda _C}) q(\lambda _S|a_{\lambda _S},b_{\lambda _S})\prod _{u,k,i} q(\tau _{ui}|a_{\tau _{ui}},b_{\tau _{ui}}) q(\eta _{uk}|a_{\eta _{uk}},b_{\eta _{uk}}) \\ &{} \times \prod _{d,v,k} q(\epsilon _{dk}|a_{\epsilon _{dk}},b_{\epsilon _{dk}})q(\theta _{dk}|a_{\theta _{dk}},b_{\theta _{dk}}) q(\beta _{vk}|a_{\beta _{vk}},b_{\beta _{vk}}) \\ &{} \times \prod _{d,v,u} q(\varvec{Z}_{dv}|\varvec{\phi }_{dv}^*) q(\varvec{Y}_{ud}|\varvec{\xi }_{ud}^{M*},\varvec{\xi }_{ud}^{N*},\varvec{\xi }_{ud}^{S*}) \end{array} \end{aligned}$$
(7)

After applying Eq. 6 together with the expected value properties for each latent variableFootnote 4, we obtain the following update equations for the variational parameters.

  • Content and social weights: \( \begin{array}{rlrl} a_{\lambda _C} &{} = a_C + \sum _{u,d,k} R_{ud}\xi _{ud,k}^{M*}, &{} b_{\lambda _C} &{} = b_C + \sum _{u,d,k} \frac{a_{\eta _{uk}}}{b_{\eta _{uk}}}\frac{a_{\theta _{dk}}}{b_{\theta _{dk}}} \nonumber \\ a_{\lambda _S} &{} = a_S + \sum _u R_{ud}\xi _{ud,k}^{M*}+\sum _v W_{dv}\phi _{dv,k}^*, &{} b_{\lambda _S} &{} = b_S + \sum _{u,d,i} R_{id}\frac{a_{\tau _{ui}}}{b_{\tau _{ui}}} \\ \end{array} \)

  • Content v (topic/tags/etc) parameters: \( \begin{array}{rlrl} a_{\beta _{vk}}&= a^0_{\beta } + \sum _d W_{dv}\phi _{dv,k}^*,&b_{\beta _{vk}}&= b^0_{\beta } + \sum _d \frac{a_{\theta _{dk}}}{b_{\theta _{dk}}} \nonumber \end{array} \)

  • Item d parameters: \( \begin{array}{rlrl} a_{\epsilon _{dk}} &{} = a^0_{\epsilon } + \sum _u R_{ud}\xi _{ud,k}^{N*}, &{} b_{\epsilon _{dk}} &{} = b^0_{\epsilon } + \sum _u \frac{a_{\eta _{uk}}}{b_{\eta _{uk}}} \nonumber \\ a_{\theta _{dk}} &{} = a^0_{\theta } + \sum _u R_{ud}\xi _{ud,k}^{M*}+\sum _v W_{dv}\phi _{dv,k}^*, &{} b_{\theta _{dk}} &{} = b^0_{\theta } + \text {E}_q[\lambda _C]\sum _u \frac{a_{\eta _{uk}}}{b_{\eta _{uk}}}+\sum _v \frac{a_{\beta _{vk}}}{b_{\beta _{vk}}} \nonumber \\ &{} &{} &{} \nonumber \end{array} \)

  • User u parameters: \( \begin{array}{rlrl} a_{\eta _{uk}} &{} = a^0_{\eta } + \sum _d R_{ud}(\xi _{ud,k}^{M*}+\xi _{ud,k}^{N*}), &{} b_{\eta _{uk}} &{} = b^0_{\eta } + \sum _d \text {E}_q[\lambda _C]\frac{a_{\theta _{dk}}}{b_{\theta _{dk}}}+ \frac{a_{\epsilon _{dk}}}{b_{\epsilon _{dk}}} \nonumber \\ a_{\tau _{ui}} &{} = a^0_{\tau } + \sum _{d} R_{ud}\xi _{ud,i}^{S*}, &{} b_{\tau _{ui}} &{} = b^0_{\tau } + \text {E}_q[\lambda _S]\sum _{d} R_{id} \nonumber \end{array} \)

  • item–content dv parameters: \( \begin{array}{rlrl} \phi _{dv,k}^*&\propto \frac{e^{\varPsi \left( a_{\beta _{vk}}\right) }}{b_{\beta _{vk}}}\frac{e^{\varPsi \left( a_{\theta _{dk}}\right) }}{b_{\theta _{dk}}}&\text {with }\sum _k \phi _{dv,k}&= 1 \nonumber \end{array}\)

  • user–item ud parameters: \( \begin{array}{rlrl} \xi _{ud,k}^{M*} &{} \propto e^{E_q[\log \lambda _C]}\frac{e^{\varPsi \left( a_{\eta _{uk}}\right) }}{b_{\eta _{uk}}} \frac{e^{\varPsi \left( a_{\theta _{dk}}\right) }}{b_{\theta _{dk}}} &{} \xi _{ud,k}^{N*} &{} \propto \frac{e^{\varPsi \left( a_{\eta _{uk}}\right) }}{b_{\eta _{uk}}} \frac{e^{\varPsi \left( a_{\epsilon _{dk}}\right) }}{b_{\epsilon _{dk}}} \nonumber \\ \xi _{ud,i}^{S*} &{} \propto e^{E_q[\log \lambda _S]} \frac{e^{\varPsi \left( a_{\tau _{ui}}\right) }}{b_{\tau _{ui}}}R_{id} &{} \text {with }&{}\sum _k \xi _{ud,k}^{M*}+\xi _{ud,k}^{N*}+\sum _i \xi _{ud,i}^{S*} = 1\nonumber \end{array}\)

Computing the ELBO: The variational updates calculated in the previous sections are guaranteed to non-decrease the ELBO. However, we still need to calculate this lower bound after each iteration to evaluate a stopping condition for the optimization algorithm. We briefly describe a particular lower-bounding for the ELBO involving the log-sum present in the Poisson rate.

Note also that the surrogate distribution is factorized using the mean field assumptions (Eq. 7), so we have a sum of terms corresponding to the expected log probability over the surrogate distribution. The terms comprising the log-probabilities of the Poisson likelihood display a expected value over a sum of logarithms of latent variables (for example \(\text {E}_q [\log (\sum _k \beta _{vk}\theta _{dk})]\)), this is a challenging computation, but we can apply another lower-boundFootnote 5 and simplify it to Eq. 8.

$$\begin{aligned} \begin{array}{rl} \text {E}_q [\log (\sum _k \beta _{vk}\theta _{dk})] &{}\ge \sum _k \phi _{dv,k}^*\left( \text {E}_q[\log \beta _{vk}]+\text {E}_q[ \log \theta _{dk}] ] \right) \\ &{}- \sum _k \phi _{dv,k}^* \log \phi _{dv,k}^* \end{array} \end{aligned}$$
(8)

This same simplification can be done to all Poisson terms independently because of the mean field assumptions. It is equivalent to using the auxiliary latent counts. So, for example, using the latent variable \(Z_{dv,k}\), \(\beta _{vk}\) and \(\theta _{dk}\), the Poisson term in the ELBO results in Eq. 9.

$$\begin{aligned} \begin{array}{rl} \text {E}_q \left[ \log \frac{p(Z_{dv})}{q(Z_{dv})} \right] &{}= \sum _k W_{dv}\phi _{dv,k}^* \text {E}_q[\log (\beta _{vk}\theta _{dk})]\\ &{}-\text {E}_q[\beta _{vk} \theta _{dk}]-W_{dv}\phi _{dv,k}^*\log (\phi _{dv,k}^*)-\log (W_{dv}!) \end{array} \end{aligned}$$
(9)

For the Gamma terms, the calculations are a direct application of ELBO formula for the appropriate variable. For example, Eq. 10 describes the resulting terms for \(\beta _{vk}\).

$$\begin{aligned} \begin{array}{rl} \text {E}_q \left[ \log \frac{p(\beta _{vk})}{q(\beta _{vk})}\right] &{}= \log \frac{\Gamma (a_{\beta _{vk} })}{\Gamma (a)}+a \log b+a_{\beta _{vk}}(1-\log b_{\beta _{vk}}) \\ &{}+(a-a_{\beta _{vk}})\text {E}_q[\log \beta _{vk}]-b\text {E}_q[\beta _{vk}] \end{array} \end{aligned}$$
(10)

Recommendations: Once we learn the latent factors of the model from the observations we can infer the user preference over the set of items using the expected value of the user–item rating \(\text {E}[ R_{ud} ]\). The recommendation algorithm ranks the unobserved items for each user according to \(\text {E}[ R_{ud} ]\) and recommend to top-M items. We utilize the variational distribution to efficiently compute \(\text {E}[ R_{ud} ]\) as defined in Eq. 11. This value can be broken down into three non-negative scores: \(\text {E}_q[ \eta _{u}]^T\text {E}_q[\epsilon _{d}]\), representing the “classic” collaborative filtering matching of users preferences and items features, \(\text {E}_q [\lambda _C]\text {E}_q[ \eta _{u}]^T\text {E}_q[\theta _{d}]\) representing the content factors contribution and \(\text {E}_q[\lambda _S] \sum _{i\in N(u)} \text {E}_q[\tau _{ui}]R_{id}\) the social influence contribution.

$$\begin{aligned} \text {E}[ R_{ud} ]&\approx \text {E}_q[ \eta _{u}]^T(\text {E}_q [\lambda _C]\text {E}_q[\theta _{d}]+\text {E}_q[\epsilon _{d}])+\text {E}_q[\lambda _S] \sum _{i\in N(u)} \text {E}_q[\tau _{ui}]R_{id} \end{aligned}$$
(11)

Complexity and Convergence: The complexity of each iteration of the variational inference algorithm is linear on the number of latent factors K, non-zero ratings nR, non-zero word-document counts nW, users U, items D, vocabulary set W and neighbors for each user nS, in other words \(O(K(nW+nR+nS+U+D+W))\). We have shown that we can obtain closed-form updates for the inference algorithm, which stems from the fact that the model is fully conjugate and in the exponential family of distributions. In this setting variational inference is guaranteed to converge, and we observed in the experiments the algorithm converging after 20 to 40 iterations.

5 Evaluation

In this section, we analyze the predictive power of the proposed model with a real world dataset and compare it with state of the art methods.Footnote 6

Datasets. to be able to compare with the state-of-art method Correlated Topic Regression with Social Matrix Factorization [5], we conducted experiments using the hetrec2011-lastfm-2k (Last.fm) dataset [17]. This dataset consists of a set of user–artists weighted interactions (“artists” is item set), a set of user–artists-tags and a set of user–user relationsFootnote 7. We process the dataset to create an artist–tags matrix by summing up all the tags given by all users to a given artist, this matrix is the item–content matrix in our model. Also, we discard the user–artists weight, considering a “1” for all observed cases. After the preprocessing, we sample 85% of the user–artists observation for training, and kept 15% held-out for predictive evaluation, selecting only users with more than 5 item ratings for the training part of the split.

Metric: Given the random splits of training and test, we train our model and use the estimated latent factors to predict the entries in the testing datasets. In this setting zero ratings can not be necessarily interpreted as negative, making it problematic to use the precision metric. Instead, we focus on recall metric to be comparable with previous work [5] and because the relevant items are available. Specifically, we calculate the recall at the top M items (recall@M) for a user, defined as:

$$\begin{aligned} \text {recall@}M =\frac{\text {number of items the user likes in Top} M}{\text {total number of items the user likes}} \end{aligned}$$
(12)

Recall@M from Eq. 12 is calculated for each user, to obtain a single measure for the whole dataset we average it over all the users obtaining the Avg. Recall@M.

5.1 Experiments

Initially we set all the Gamma hyperparameters to the same values \(a_{all}\) Footnote 8 and \(b_{all}\) Footnote 9 equal to 0.1, while varying the latent dimensionality K. For each value of K we ran the experiments on 30 multiple random splits of the dataset in order to be able to generate boxplots of the final recommendation recall. We compare our results with the reported results in [5] for the same dataset and with optimal parameters. In this first experiment we let the algorithm estimate the optimal content weight \(\lambda _C\) and social weight \(\lambda _S\). It is possible to see in Fig. 2 that PoissonMF-CS is consistently outperforming by large margin CTR-SMF and CTR (Fig. 2a), while outperforming other Poisson factorization methods (Fig. 2b) by a significant margin (\(p \le 1\cdot 10^{-6}\) in Wilcoxon paired test for each M). This may be indicative that both the choice of Poisson likelihood with non-negative latent factors and the modelling of content and social weights have positive impact in the predictive power of the model.

Fig. 2.
figure 2

Comparison of PoissonMF-CS with alternative models. Each subplot is the result of running the PoissonMF-CS recommendation algorithm over 30 random splits of the Hetrec2011-lastfm dataset for a fixed number of latent features K (in this case, \(K=\{10,15\}\)). The values for CTR-SMF, CTR, and PMF was taken from [5], and according to the reported results, they are the best values found after a grid search.

Fig. 3.
figure 3

Impact of the number of latent variables (K) parameter on the Av. Recall@M metric for different number of returned items (M). Each subplot is the result of running the PoissonMF-CS recommendation algorithm over 30 random splits of the dataset with K varying in (5, 10, 15, 20, 50, 100)

Fig. 4.
figure 4

Evaluation of the impact of content and social weight parameters (in all experiments in this figure \(K=10\))

Fig. 5.
figure 5

Evaluation of the impact of latent Gamma hyperpriors on the recall (in all experiments in this figure \(K=10\))

Model selection. Figure 3 shows the resulting predictive performance of PoissonMF-CS with different values of number of latent factors K in Hetrec2011-lastfm dataset. We concluded that the optimal choice for K is 15. This result is important, indicating that the model is generating compact latent representations, given that the optimal choice of K reported for CTR-SMF in the same dataset is 200. In Fig. 5 we show the results for the latent variable hyperparameters. We ran one experiment varying the hyperparameters \(a_{all}\) and \(b_{all}\) to understand the impact of these hyperparameters in the final recommendation. We noticed that the optimal values for different values of M for both hyperparameters are between 0.1 and 1, a result consistent with the recommendations in the literature [4, 6, 12] and with the statistical intuition that Poisson likelihood with Gamma prior with shape parameter \(a<1\) favour sparse latent representation.

The next experiment was to set the content weight and social weight to fixed values and evaluate the impact of these weights on the result. In Fig. 4 we can see that the resulting pattern for different values of M is not evident, but indicates that the resulting recall is less sensitive to change in the content and social weights parameters than on the hyperparameters \(a_{\text {all}}\) and \(b_{\text {all}}\). This is also indicative that the importance of social and content factors is not the same at different points of the ranked list of recommendations.

6 Conclusion

This article describes PoissonMF-CS, a joint Bayesian model for recommendations integrating three sources of information: item textual content, user social network, and user–item interactions. It generalizes existent Poisson factorization models for recommendations by adding both content and social features. Our experiment shows that the proposed model consistently outperforms previous Poisson models (SPF and CTPF) and alternative joint models based on Gaussian probabilistic factorization and LDA (CTR-SMF and CTR) on a dataset containing both content and social side information. These results demonstrate that joint modeling of social and content features using Poisson models improves the recommendations, can have scalable inference and generates more compact latent features. Although the batch variational inference algorithm is already efficientFootnote 10, one future improvement will be the design of Stochastic Variational Inference algorithm for very large scale inference.