1 Introduction

Recommendation systems play an important role in filtering out irrelevant information and suggest contents that interest users. Examples include book recommendation in LibraryThing, movie recommendation in Netflix. Collaborative Filtering (CF) [9] has been successful in predicting user needs to be based on the like-minded users’ interests using their historical records. However, in practice, users rate a very few as compared to available items. Besides, new users and/or items are added at regular intervals. These two phenomena lead to sparsity and cold-start problems.

To deal with the data sparsity and cold-start problems, researchers have exploited Cross-Domain Collaborative Filtering (CDCF) [3, 7, 8, 15], which leverages the knowledge extracted from related domains. For example, users who watch movies from adventure genre will most likely be interested in reading books written on adventure travel. However, in most cases, users and/or items across domains do not overlap. In such a scenario, exploiting user-generated tags [2, 4, 5, 11, 13,14,15,16] (e.g. tags like ancient-literature or military-history) for bridging the related domains is becoming a popular way for enhancing personalized recommendations. That is, though we do not know the exact mapping of the users and/or items across domains, we establish the connections with the following assumption: the users who use similar tags are similar, and the items which are assigned with similar tags are similar. Nevertheless, existing tag based CDCF models work based on common tags and its co-occurrence count alone to bridge the related domains [5, 13, 14]. Hence, these models capture the syntax similarities between the tags and ignore the semantic relationships between them. We illustrate these by the following toy example.

Fig. 1.
figure 1

Illustration of tag based CDCF

Let us assume that a cross-domain system has users (Bill and Mark), items (The prestige and Inception from Movie domain, and Inferno and Bk from book domain, respectively), ratings (in the scale of 1–5, where 1 being low and 5 being high) and their tag assignments as shown in Fig. 1.

Treating tags as tokens/lexicons, most of the existing models connect the users Bill and Mark by only the tags biography and war based on their direct usage. These models completely ignore the relationship between the tags science-fiction and fiction (as well as ww2 and hitler). However, these tags provide a lot of knowledge about the similarities between the movies The Prestige and Inception, and also the movie The Prestige and the book Inferno since most users who are interested in movies or books related to fiction might also be interested in movies or books related to science-fiction. This can be well captured by word-vector embeddings such as word2vec [12], because the cosine similarity between the embedding vectors fiction and science-fiction is 0.61Footnote 1 which is considerably high. Besides the above facts on the similarity between items that can be established through tags, the similarity between the users Mark and Bill can also be strongly established with the help of semantic word embeddings, because the users are interested in similar topics fiction and science-fiction based on the items they rated. Accounting all of the above makes us predict the missing rating of the user Bill on the movie Inception more accurately.

Contributions. We propose a novel extension of SVD++, TagEmbedSVD, to the cross-domain setting where there is no assumption that users and items overlap across domains. TagEmbedSVD leverages user-generated tags to bridge the related domains. That is, TagEmbedSVD employs word2vec – a word vector representation for finding the semantic relationship between the tags to bridge the domains and enhance the recommendation performance. We perform comprehensive experiments on two real-world datasets – LibraryThing and MovieLens, and show that our proposed model outperforms existing tag based CDCF models, particularly in sparse and cold-start settings. Our implementation is available at https://github.com/mvijaikumar/TagEmbedSVD.

2 Proposed Model

Problem Formulation. Suppose we have a set of ratings \([r_{uj}^S]_{m_S \times n_S}\) and \([r_{uj}^T]_{m_T \times n_T}\) from source and target domains. Here \(r_{uj}^S, r_{uj}^T \in [a,b] \cup \{0\}\), and \(a,b > 0\) denote the minimum and maximum ratings of user u on item j and 0 denotes unavailable ratings, respectively. Further, \(m_S,n_S,m_T\) and \(n_T\) represent the number of users and items from source and target domains respectively. In addition, we are given tags associated with every user u and item j denoted by sets \(\mathcal {T}_u\) and \(\mathcal {T}_j\) respectively. Here, the tag can be a word or a set of words or a phrase (for example, philosophy, mind-blowing, one time watchable).

Let \(U^S, U^T, I^S\) and \(I^T\) denote the sets of users and items from source and target domains respectively. Note that, \(U = U^S \cup U^T\) and \(I = I^U \cup I^T\), and \(U^S \cap U^T = \varnothing \) and \(I^S \cap I^T = \varnothing \). Let \(\mathcal {I}_u\) and \(\mathcal {U}_j\) be the set of items rated by user u and the set of users who rated item j respectively. Let \(\varOmega ^S = \lbrace (u,j) : r_{uj}^S > 0 \rbrace \) and \(\varOmega ^T = \lbrace (u,j) : r_{uj}^T > 0 \rbrace \) be the sets indicating the available ratings and \(\varOmega = \varOmega ^S \cup \varOmega ^T\). Our goal here is to predict the unavailable ratings for the users on items in the target domain with the help of available ratings and tag information from both domains. Formally, we want to predict ratings \(r_{uj}^T, ~\forall (u,j) \not \in \varOmega ^T\) in the target domain using \(r_{uj}^S, ~\forall (u,j) \in \varOmega ^S, r_{uj}^T, ~\forall (u,j) \in \varOmega ^T\) and \(\mathcal {T}_u, \mathcal {T}_j, ~\forall u \in U,~\forall j \in I \). To avoid notational clutter, wherever the context is clear from \(\varOmega ^S\) and \(\varOmega ^T\), we drop the superscripts S and T and combine the ratings from the source and target domains together.

2.1 TagEmbedSVD

In this section, we explain our proposed model – TagEmbedSVD, in detail. TagEmbedSVD is an extension of SVD++ [9]. The main objective here is to incorporate knowledge learned from tags into the SVD++ model for cross-domain settings. In this way, tags are able to not only provide additional knowledge to understand the system but also able to bridge the users and items across domains. Let \(p_u \in \mathbb {R}^d\) be user (u) embedding and \(q_j \in \mathbb {R}^d\) and \(y_j \in \mathbb {R}^d\) be item (j) embeddings, \(\mu \in \mathbb {R},b_u \in \mathbb {R}\) and \(b_j \in \mathbb {R}\) be the mean value of all the available ratings, user bias and item bias, respectively. Let \(t_k \in \mathbb {R}^c\) be embedding vector associated with tag k, and c denote the embedding dimension. Here, we predict rating \(\hat{r}_{uj}\) as follows:

$$\begin{aligned} \hat{r}_{uj} = \mu + b_u + b_j + (p_u + |\mathcal {I}_u|^{-\frac{1}{2}} \!\sum _{i \in \mathcal {I}_u}\! y_i)^{\prime } q_j + \frac{\alpha }{|\mathcal {T}_u|} \sum _{k \in \mathcal {T}_u} w_u ^{\prime } E t_k + \frac{\beta }{|\mathcal {T}_j|} \sum _{k \in \mathcal {T}_j} x_j ^{\prime } F t_k, \end{aligned}$$
(1)

where \(w_u \in \mathbb {R}^d\) captures user u’s preferences towards the tags and \(x_j \in \mathbb {R}^d\) is item j’s characteristics towards the tags. We obtain embeddings for tags from word2vec [12]. Thus, any two tags k and \(k'\) can be compared by its embeddings \(t_k\) and \(t_k'\) from the start of the training. Due to this fact, if two users share similar tag preferences, irrespective of their domain difference, we can obtain preference of users on tags from different domains. That is, \(p_u\) and \(q_j\) are learned from only the available ratings from the corresponding domains since users (items) from different domains do not share any item (user) in common. However, \(w_u\) and \(x_j\) are learned combinedly irrespective of its domain difference through the tag embeddings. Here, \(\alpha \) and \(\beta \) control the influence of tags on predictions. In our model, to share a common embedding space and to have flexibility in choosing the dimension of \(t_k\), we use projection matrices E and \(F \in \mathbb {R}^{d \times c}\).

The number of occurrences of the tags associated with users and items plays an important role in characterising users and items. To adapt this knowledge we modify the definition \(|\mathcal {T}_u|\) to be \(\sum _{k \in \mathcal {T}_u} \eta _{uk}\), where \(\eta _{uk}\) denotes the frequency of the tag \(t_k\) associated with the user u. Similarly, we define \(|\mathcal {T}_j|\) to be \(\sum _{k \in \mathcal {T}_j} \eta _{jk}\). Therefore, Eq. (1) becomes:

$$\begin{aligned} \hat{r}_{uj}&= \mu + b_u + b_j + (p_u + |\mathcal {I}_u|^{-\frac{1}{2}} \sum _{i \in \mathcal {I}_u} y_i)^{\prime } q_j +\nonumber \\&\frac{\alpha }{|\mathcal {T}_u|} \sum _{k \in \mathcal {T}_u} \eta _{uk} w_u ^{\prime } E t_k + \frac{\beta }{|\mathcal {T}_j|} \sum _{k \in \mathcal {T}_j} \eta _{jk} x_j ^{\prime } F t_k. \end{aligned}$$
(2)

Additionally, we use a weighted-regularization technique [6, 9] to control over-fitting which arises due to the sparsity issue. Here, popular users and items are penalized less since more ratings are available, and users and items having less ratings are penalized more since only a few ratings are available for them. For instance, we regularize the user representation \(p_u\) by multiplying scalar \(|\mathcal {I}_u|^{-\frac{1}{2}}\) instead of \(|\mathcal {I}_u|\). Note that, the former penalizes less on users who rated more items as compared to the latter also the weighted-regularization does not drop any user or item. Let \(\lambda \) and \(\lambda _M\) be the positive hyperparameters to control over-fitting. We have the following optimization problem:

$$\begin{aligned} \begin{aligned} \min _{p_*, q_*, y_*, x_*, w_*,b_*} \mathcal {L} = \frac{1}{2} \sum _{(u,j) \in \varOmega ^S \cup \varOmega ^T} (\hat{r}_{uj} - r_{uj})^2 + \frac{\lambda }{2} (\sum _{u}|\mathcal {I}_u|^{-\frac{1}{2}} b_u^2 + \\ \sum _{u}|\mathcal {U}_j|^{-\frac{1}{2}} b_j^2) + \frac{\lambda }{2} \sum _{u}|\mathcal {I}_u|^{-\frac{1}{2}} (||p_u||^2 + ||w_u||^2) +\\ \frac{\lambda }{2} \sum _{j} |\mathcal {U}_j|^{-\frac{1}{2}} (||q_j||^2 + ||x_j||^2) + \frac{\lambda }{2} \sum _{i}|\mathcal {U}_i|^{-\frac{1}{2}}||y_i||^2 + \frac{\lambda _M}{2} (||E||_F^2 + ||F||_F^2). \end{aligned} \end{aligned}$$
(3)

Complexity Analysis. Let \(a_I\) and \(a_T\) be the average number of rated items by users and average number of tags used by users (or tags assigned to items). Naive implementation of TagEmbedSVD takes \(O(a_I |\varOmega |d + a_T |\varOmega |dc)\) to compute the objective value in Eq. (3) since we project tags into lower dimension using E and F. However, we can use hashtable to store and retrieve the sum of the embeddings of the tags corresponding to each user and item and this results in \(O(a_I|\varOmega |d + |\mathcal {T}^{uniq}|dc)\) time complexity, where \(|\mathcal {T}^{uniq}|\) represents the number of unique tags in the system. Note that, in real time \(|\mathcal {T}^{uniq}|dc<< a_I|\varOmega |d\). In addition, gradient computational effort of TagEmbedSVD requires \(O(a_I |\varOmega | d + d |\mathcal {T}^{uniq}| c + a_T |\varOmega | d c)\) whereas SVD++ requires \(O(a_I |\varOmega | d)\). This is due to bottleneck in computing \(\frac{\partial {\mathcal {L}}}{\partial E}, \frac{\partial {\mathcal {L}}}{\partial F}\). It can be reduced considerably by updating \(\frac{\partial {\mathcal {L}}}{\partial E}, \frac{\partial {\mathcal {L}}}{\partial F}\) after some fixed number of intervals instead of every iteration. In our experiments, we observed that updating E and F after every ten iterations does not degrade the performance significantly.

Table 1. Dataset statistics
Table 2. Performance comparison for setting 1.
Table 3. Performance comparison for setting 2.

3 Experiments

Datasets and Evaluation Methodology. We used two publicly available datasets – MovieLens-10MFootnote 2 and LibraryThingFootnote 3 for cross-domain collaborative filtering setting. Statistics of the datasets are given in Table 1. For constructing training and validation split pairs, we follow the same procedure as used in [13, 14]. From the target domain, we extract \(K\%\) of the available ratings for the training set and the remaining \((100-K)\%\) ratings are used for either validation or test purpose. We extract six such pairs. The first pair is used for tuning the hyperparameters, that is, the corresponding left out \((100-K)\%\) ratings act as a validation set. Once the hyperparameters values are obtained, we train the other five pairs with these values and obtain the test set performance. Here, in these five pairs, left out \((100-K)\%\) ratings act as a test set. We report the average test error obtained from these five test sets as the final performance. During these extractions, we make sure that there exists at least one rating for all the users and items in the training set.

We conduct experiments under the following settings. In all the above settings, we add the source domain ratings (if multiple source domains are available we combine their ratings together) as a part of the training set.

  1. 1.

    Setting 1: We set \(K=80\).

  2. 2.

    Setting 1, cold-start users (items): This is same as setting 1, but, results corresponding to only cold-start users (items) are reported.

  3. 3.

    Setting 2: Here, we set \(K=40\) to introduce more sparsity in the target domain part of the training set.

  4. 4.

    Setting 2, cold-start users (items): This is same as setting 2, but, results corresponding to only cold-start users (items) are reported.

Further, all in Tables 2 and 3 indicates that all the users and items are included in test set, where as, cold-start users (cold-start items) indicates only users who rated less than five items (items which received ratings from less than five users) are included in test set. Similar definitions are followed in [6]. In Tables 2 and 3 bold-faced value indicates best performance, and ‘Improvement’ indicates the relative improvements that TagEmbedSVD achieves against the best performance among the comparison models highlighted by symbol *.

Comparison of Models. We compare our model with the following tag based CDCF models.

  1. 1.

    TagCDCF [13] extends matrix factorization and leverages tag information to improve the performance by understanding the similarities between users and items with the help of common tags.

  2. 2.

    GTagCDCF [14] connects the source and target domain by common tags. It additionally takes the frequency of the tag usage into account.

  3. 3.

    TagGSVD++ [5] is an extension to SVD++ model. It uses tag information in place of implicit feedback in SVD++ to obtain user and item representations.

Since it has been demonstrated in [13, 14] that the tag based CDCF models perform better than single domain models we do not include them for comparison.

Metrics. We employ two well-known metrics, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) for performance comparisons [6, 9, 10].

Fig. 2.
figure 2

Impact of parameter (a) \(\alpha \) and (b) \(\beta \) on cold-start users and items respectively. Here y-axis indicates MAE value.

Parameter Setting. We tune our hyperparameters using random hyperparameter search [1] and validation set with 100 trials for each model. For TagEmbedSVD, we tune \(\lambda \) from [0.01, 2], \(\alpha \) from [0.00001, 0.05], \(\beta \) from [0.00001, 0.05] and latent dimension (d) from {5, 10, 20, 30, 40}. Ranges for comparison models were selected from the respective papers [5, 13, 14].

3.1 Results

Performance: Tables 2 and 3 compare the performance of TagEmbedSVD with the other tag-based CDCF models. We conduct a paired t-test and all the improvements are statistically significant for \(p<0.01\). The main findings from Tables 2 and 3 are summarized as follows:

  1. 1.

    TagEmbedSVD performs better than the other models by up to 5.84% (setting 1) and up to 4.60% (setting 2) when all the users and items are used. Similarly, it gives improvements up to 7.24% (setting 1) and 5.49% (setting 2) for cold-start users, and 5.54% (setting 1) and 5.04% (setting 2) for cold-start items respectively. This demonstrates the significance of using distributed representations for tags to improve the performance within and across domains.

  2. 2.

    One of the main reasons for TagEmbedSVD’s better performance than that of TagCDCF and GTagCDCF is the utilization of all the available tags instead of just common tags. Further, despite both being extensions to SVD++, we gain improvement in TagEmbedSVD over TagGSVD++. The reason is that the former treats tags as tokens, hence fiction and science-fiction (or hitler and ww2) are two different tags. Whereas, the latter utilizes the pre-trained distributed representations for the tags, hence, they are very close to each other in the embedding space.

Impact of Parameters \(\alpha \) and \(\beta \): We investigate the effect of parameters \(\alpha \) and \(\beta \) in Eq. (2), that control the influence of tag information to TagEmbedSVD. Note that, letting \(\alpha \) and \(\beta \) to zero, it results in the SVD++ model. For cold-start users in datasets ML and LT, we fixed the other hyperparameter values and varied \(\alpha \) in both setting 1, and setting 2. As we increase the value of \(\alpha \), the performance of the model improves as shown in Fig. 2(a). If \(\alpha \) is set to the higher value, the performance decreases. It is because the information comes from tags dominates the rating values. We get similar behavior from the parameter \(\beta \) in cold-start item setting. This is illustrated in Fig. 2(b). Further, objective function value with respect to number of iterations are given in Fig. 2(c).

4 Conclusion

In this paper, we propose a simple and easy to train tag based cross-domain collaborative filtering model – TagEmbedSVD for leveraging tag information to cross-domain recommendations. TagEmbedSVD differs from the other models by employing distributed representations for tags to bridge the source and target domains. In this way, any two tags can be compared. Our experimental results show that our model performs better than other tag-based models in various sparse and cold-start settings. Although we use a single source and single target domain, TagEmbedSVD is general and any number of domains can be used without further modifications in the model.