Attention distribution guided information transfer networks for recommendation in practice

https://doi.org/10.1016/j.asoc.2020.106772Get rights and content

Highlights

  • We propose a model ADGITN that conforms to real application scenarios.

  • Our model employs attention distribution to guide the extraction of features.

  • It applies to the tasks with different distributions on data set.

Abstract

Recently, an increasing number of deep learning-based methods have been applied in recommendation. Most such methods outperform traditional methods, especially when using the natural language processing (NLP) technique with review texts. Many deep learning-based recommender systems are used to learn latent representations of reviews written by target users and reviews written for target items. They are then combined to predict the rating of the target user for the target item. However, most previously proposed review-based deep learning methods do not conform to real-world application scenarios, in which we cannot obtain the reviews of the target user for the target item (called U2I review). In real-world recommendation settings, items are always recommended to users before they have experienced them. Therefore, the review of a target user for a target item would not be available during the testing and validation process. Many methods, such as DeepCoNN and D-ATT, do not exclude the U2I review in the process of validation and testing. Therefore, the process of testing is different from real-world application scenarios, and these methods obtain substantial valuable information from the U2I review that target users write for target items. We propose a model called ADGITN and a training strategy to solve this problem. When training, the auxiliary model learns two attention distributions that the U2I reviews over user reviews and item reviews by auxiliary tasks. These two distributions are used to guide the learning of attention distributions between user reviews and item reviews of the main model. Thus, the main model could learn how to extract attention distributions between user reviews and item reviews according to the valuable information extracted from U2I reviews. During validation, only the main model works, and it could extract better attention distributions even without the help of a U2I review. Extensive experiments show the effectiveness of our model. We validate our model on the Amazon and Yelp19 datasets, and the results show that our model outperforms existing excellent models, with up to 13.8% relative improvement compared to the performance of MPCN, which is one of the best review-based deep learning models for recommendation.

Introduction

E-commerce platforms are playing an increasingly important role in people’s lives. Many e-commerce platforms collect the ratings and review texts of users for items and other types of user feedback so that they can obtain a better understanding of the users. Such feedback can guide the e-commerce platform to recommend items that users have not purchased but in which users may be interested. The recommender system is an important component in implementing this process. As a branch of data science, recommender systems require substantial amounts of data that can represent user preferences and item attributes. Such feedback is currently useful. We can exploit this feedback to predict the rating the user would give an item that he has yet to purchase. We could then recommend items that obtain high ratings for the user.

Since we want to predict ratings by users for items, the historical ratings of the user for other items are helpful, and we can obtain these ratings from the abovementioned feedback. The ratings of users can directly reflect whether users like an item. With ratings, we can employ collaborative filtering (CF) techniques, which are based on the idea that users who give similar ratings to the same items will have similar preferences and that they may purchase similar items in the future. There have been many CF-based methods using rating information to learn user latent preferences such as matrix factorization (MF) [1], probabilistic matrix factorization (PMF) [2] and collaborative topic regression (CTR) [3].

However, ratings cannot precisely express user implicit preferences for items. For example, if both user A1 and user A2 give movie B a rating of 5 (full score), we can see that users A1 and A2 like B very much; however, we do not know why they like movie B. Maybe user A1 liked the movie’s cast but user A2 instead liked the movie’s plot; i.e., they gave the same rating but for different reasons. If we want to obtain more accurate user preferences, reviews are essential. Reviews contain valuable information that can reflect not only user preferences but also item attributes. Therefore, many researchers have attempted to use natural language processing (NLP) techniques to extract user preferences and item attributes from review texts. The most successful methods, such as DeepCoNN [4], TransNets [5], D-ATT [6] and MAHR [7], are based on deep learning. They are built on an intuition that a review written by a user can represent the user because the reviews contain the user’s preferences. A review that a user writes for an item can also represent the item because the review also contains item attributes. All reviews written by a user are combined to represent the user, and all reviews written for an item are combined to represent the item. In the abovementioned model, the reviews are transferred to the embedding layer first, and then other deep learning layers, such as attention layers, convolutional layers and max pooling layers, are used to extract latent features from the reviews. The above process outputs two feature blocks, one for the user reviews and the other for the item reviews. Finally, these two feature blocks are passed to a factorization machine (FM) [8] block, or an inner product is performed directly.

MPCN [9] is a state-of-the-art deep learning-based model. The authors of MPCN noted that not all reviews have the same importance, and naive integration of all reviews to represent users or items will produce noise. For example, assume we want to predict the rating that a user will give a comedy movie. The historical reviews of the user may contain reviews for other comedy movies, horror movies and documentaries. Obviously, the reviews for other comedy movies are more important than other reviews. Therefore, one block of MPCN is used to choose a small number (1–5) of the most important reviews and use them only for prediction. However, to the best of the author’s knowledge, the deep learning model can find the important information by itself. In other words, if a review contains valuable information for this prediction, the model will pay greater attention to it. The model will pay little attention to other reviews so that the model will not be affected by them. Therefore, we should not worry that reviews of low importance will affect the model, as the model will neglect them. Moreover, if the model chooses only a small number of reviews, it will lose some information. Although this is a small amount of less important information, it can reflect user preferences and item attributes. Table 1 shows the main different attributes of these extant models and our model.

Most models have a common problem in which their validation process cannot conform to real application scenarios in which we do not have the U2I review that the target user gives the target item. As mentioned above, an item is always recommended to users before they experience it. Therefore, the review of a target user for a target item would not be available during testing and validation. If the validation and test dataset contain a U2I review, it is actually a data leak. When we want to predict the rating of one user for an item, the inputs of most models are parts of reviews of the user and item. This is from the intuition that the historical reviews of a user can represent the user and that the historical reviews corresponding to an item can represent the item. The historical reviews may contain the U2I review that the target user writes for the target item. In other words, U2I is the review corresponding to the rating to be predicted most closely. If the inputs of models contain U2I, it is easy for the model to find the importance of U2I, and the model will give it a large weight. This task can then be simplified into a typical emotion analysis task. Since most of the authors evidently did not remove the U2I, their models’ performances will likely decrease in real application scenarios. In [5], the authors also found this problem; thus, they created a model, called TransNets, with a student–teacher​ architecture [10], [11], [12] to guide the main network to learn the valuable information extracted from U2I by a submodel simply used in the training process. However, the signal used to guide the main network to learn is too weak to generalize in the validation process and for real applications since each user–item pair in the training set corresponds to one signal, and the same user–item pairs do not appear in the validation set, testing set or real application dataset. When the model is given user–item pairs that it has never seen, it does not know how to extract useful features. Therefore, we propose the ADGITN model, which learns two teacher signals for the user and item separately. These signals are the attention distribution learned from U2I by a submodel and auxiliary tasks. These signals can guide the learning of the attention distribution of the main model. Since we have two signals for the user and item separately, our model can perform well as long as it has seen the user and item previously, even though it has never seen the user–item pair. It is easy to guarantee the condition that the model has seen the user and the item in real application scenarios as long as we update the model every few days. Incidentally, we do not consider the problem of cold start.

Contextual attributes usually play an important role in recommendation algorithms. These kinds of attributes, such as user region, device type, user age, and user sex, contain additional information that can reflect the characteristics of users and items. The performance of the model will be better if the context-aware conditions are involved. However, to be fair, we must be consistent with extant research that uses only the review information.

The main contributions of this paper are as follows:

  • We propose a model that conforms to real application scenarios. To the best of our knowledge, our ADGITN is the first model to apply the attention distribution from U2I to guide the extraction of features from user reviews and item reviews. This represents a novel idea for tasks whose training set, validation set and testing set have different data distributions.

  • We conduct mass experiments on four benchmark datasets, and our model outperforms the deep learning models DeepCoNN, D-ATT, TransNet and MPCN.

  • We exploit the attention distribution of U2I over user reviews and item reviews. We then select the most important reviews of the user and item that have the largest attention distribution weight to compare with the U2I review. This confirms the effectiveness of guidance by the attention distribution.

The remainder of this paper is organized as follows: related works will be discussed in Section 2. Section 3 will introduce our proposed model. The experiments and evaluation will be presented in Section 4. In Section 5, we perform some ablation analysis. We conduct an attention distribution analysis in Section 6, and finally, we conclude our work in Section 7.

Section snippets

Related work

With the rapid development of e-commerce, recommender systems have become a popular research area. We can take advantage of the rich data collected from e-commerce platforms to build many powerful models with machine learning and deep learning techniques. Recently, researchers have found increasingly more useful data, such as reviews that users write for items, sessions of users recorded by backends and other user behavior when browsing e-commerce platforms, that can be thoroughly mined. These

Proposed method

In this section, we present our proposed method attention distribution guide information transfer network (ADGITN), whose architecture is illustrated as follows. Fig. 1 is the architecture for the feature extraction model (FEM). Fig. 2 is the architecture for the validation and real-world application. Fig. 3 is the architecture for training. Specifically, our model has two different architectures for the training and validation processes.

In validation, we should simulate real-world application

Experiment and evaluation

In this section, we introduce benchmark datasets, our experimental setup and the evaluation. Our experiments will address the following research questions (RQs) in Sections 4 Experiment and evaluation, 5 Ablation analysis:

RQ1: What is the performance of our model compared with existing models such as DeepCoNN, D-ATT, TransNet and MPCN?

RQ2: Is the design of the model effective in extracting valuable information from a U2I to guide the learning of the main model? This question is also mentioned

Ablation analysis

In this section, we discuss the impact of key architectural features. We design six types of ablation experiments on the Yelp dataset, which are introduced as follows.

  • NoU2I: To prove the effectiveness of attention distribution-guided information transfer, which is discussed in Section 3, and answer RQ2, we delete the auxiliary models that correspond to U2I during training. The blocks to be deleted contain two attention layers, the FEM block of U2I, two auxiliary MSE loss functions and two

Attention distribution analysis

In this section, we discuss the attention distribution’s contribution and answer RQ3. The key idea of our work is transmitting valuable information that the submodel extracts from the U2I review, which is obtained in training only in the main model by the attention distribution Au and Ai. The two attention layers learn attention distributions for the target user and target item, respectively. The attention distributions are used to weight the encoded users and encoded items, respectively.

Conclusion

We proposed a review-based deep learning model for recommendation. Our validation process conforms to real-world application scenarios in which we cannot obtain the U2I review. To better utilize the valuable information of the U2I review, we designed four auxiliary tasks for training to extract information from the U2I review and transferred them to the main model via attention distributions. These attention distributions can be seen as two teacher signals: one for encoded users and one for

CRediT authorship contribution statement

Gang Sun: Conceptualization, Methodology, Software, Validation. Yu Li: Formal analysis, Writing - original draft, Validation. Hongfang Yu: Supervision, Funding acquisition, Project administration. Victor Chang: Writing - review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was partially supported by the PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (PCL2018KP001).

References (35)

  • KorenY. et al.

    Matrix factorization techniques for recommender systems

    Computer

    (2009)
  • SalakhutdinovR. et al.

    Probabilistic matrix factorization

    Neural Inf. Process. Syst.

    (2007)
  • C. Wang, D.M. Blei, Collaborative topic modeling for recommending scientific articles, in: The 17th ACM SIGKDD...
  • L. Zheng, V. Noroozi, P.S. Yu, Joint deep modeling of users and items using reviews for recommendation, in: The Tenth...
  • CatherineR. et al.

    Transnets: Learning to transform for recommendation

    (2017)
  • S. Seo, J. Huang, H. Yang, et al. Interpretable convolutional neural networks with dual local and global attention for...
  • LinZ.P. et al.

    Joint deep model with multi-level attention and hybrid-prediction for recommendation

    Entropy

    (2019)
  • S. Rendle, Factorization machines, in: The 10th IEEE International Conference on Data Mining, 2010, pp....
  • Y. Tay, A.T. Luu, S.H. Hui, Multi-pointer co-attention networks for recommendation, in: The 24th ACM SIGKDD...
  • Y. Zhang, T. Xiang, T.M. Hospedales, et al. Deep mutual learning, in: IEEE Conference on Computer Vision and Pattern...
  • C. Buciluă, R. Caruana, A.N. Mizil, Model compression, in: The 12th ACM SIGKDD International Conference on Knowledge...
  • HintonG. et al.

    Distilling the knowledge in a neural network

    (2015)
  • FengY.F. et al.

    Deep session interest network for click-through rate prediction

    (2019)
  • B. Hidasi, A. Karatzoglou, Recurrent neural networks with top-k gains for session-based recommendations, in: The 27th...
  • SunF. et al.

    Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

    (2019)
  • DevlinJ. et al.

    BERT: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • RadfordA. et al.

    Language models are unsupervised multitask learners

    OpenAI Blog

    (2019)
  • Cited by (3)

    View full text