Attention distribution guided information transfer networks for recommendation in practice
Introduction
E-commerce platforms are playing an increasingly important role in people’s lives. Many e-commerce platforms collect the ratings and review texts of users for items and other types of user feedback so that they can obtain a better understanding of the users. Such feedback can guide the e-commerce platform to recommend items that users have not purchased but in which users may be interested. The recommender system is an important component in implementing this process. As a branch of data science, recommender systems require substantial amounts of data that can represent user preferences and item attributes. Such feedback is currently useful. We can exploit this feedback to predict the rating the user would give an item that he has yet to purchase. We could then recommend items that obtain high ratings for the user.
Since we want to predict ratings by users for items, the historical ratings of the user for other items are helpful, and we can obtain these ratings from the abovementioned feedback. The ratings of users can directly reflect whether users like an item. With ratings, we can employ collaborative filtering (CF) techniques, which are based on the idea that users who give similar ratings to the same items will have similar preferences and that they may purchase similar items in the future. There have been many CF-based methods using rating information to learn user latent preferences such as matrix factorization (MF) [1], probabilistic matrix factorization (PMF) [2] and collaborative topic regression (CTR) [3].
However, ratings cannot precisely express user implicit preferences for items. For example, if both user A1 and user A2 give movie B a rating of 5 (full score), we can see that users A1 and A2 like B very much; however, we do not know why they like movie B. Maybe user A1 liked the movie’s cast but user A2 instead liked the movie’s plot; i.e., they gave the same rating but for different reasons. If we want to obtain more accurate user preferences, reviews are essential. Reviews contain valuable information that can reflect not only user preferences but also item attributes. Therefore, many researchers have attempted to use natural language processing (NLP) techniques to extract user preferences and item attributes from review texts. The most successful methods, such as DeepCoNN [4], TransNets [5], D-ATT [6] and MAHR [7], are based on deep learning. They are built on an intuition that a review written by a user can represent the user because the reviews contain the user’s preferences. A review that a user writes for an item can also represent the item because the review also contains item attributes. All reviews written by a user are combined to represent the user, and all reviews written for an item are combined to represent the item. In the abovementioned model, the reviews are transferred to the embedding layer first, and then other deep learning layers, such as attention layers, convolutional layers and max pooling layers, are used to extract latent features from the reviews. The above process outputs two feature blocks, one for the user reviews and the other for the item reviews. Finally, these two feature blocks are passed to a factorization machine (FM) [8] block, or an inner product is performed directly.
MPCN [9] is a state-of-the-art deep learning-based model. The authors of MPCN noted that not all reviews have the same importance, and naive integration of all reviews to represent users or items will produce noise. For example, assume we want to predict the rating that a user will give a comedy movie. The historical reviews of the user may contain reviews for other comedy movies, horror movies and documentaries. Obviously, the reviews for other comedy movies are more important than other reviews. Therefore, one block of MPCN is used to choose a small number (1–5) of the most important reviews and use them only for prediction. However, to the best of the author’s knowledge, the deep learning model can find the important information by itself. In other words, if a review contains valuable information for this prediction, the model will pay greater attention to it. The model will pay little attention to other reviews so that the model will not be affected by them. Therefore, we should not worry that reviews of low importance will affect the model, as the model will neglect them. Moreover, if the model chooses only a small number of reviews, it will lose some information. Although this is a small amount of less important information, it can reflect user preferences and item attributes. Table 1 shows the main different attributes of these extant models and our model.
Most models have a common problem in which their validation process cannot conform to real application scenarios in which we do not have the U2I review that the target user gives the target item. As mentioned above, an item is always recommended to users before they experience it. Therefore, the review of a target user for a target item would not be available during testing and validation. If the validation and test dataset contain a U2I review, it is actually a data leak. When we want to predict the rating of one user for an item, the inputs of most models are parts of reviews of the user and item. This is from the intuition that the historical reviews of a user can represent the user and that the historical reviews corresponding to an item can represent the item. The historical reviews may contain the U2I review that the target user writes for the target item. In other words, U2I is the review corresponding to the rating to be predicted most closely. If the inputs of models contain U2I, it is easy for the model to find the importance of U2I, and the model will give it a large weight. This task can then be simplified into a typical emotion analysis task. Since most of the authors evidently did not remove the U2I, their models’ performances will likely decrease in real application scenarios. In [5], the authors also found this problem; thus, they created a model, called TransNets, with a student–teacher architecture [10], [11], [12] to guide the main network to learn the valuable information extracted from U2I by a submodel simply used in the training process. However, the signal used to guide the main network to learn is too weak to generalize in the validation process and for real applications since each user–item pair in the training set corresponds to one signal, and the same user–item pairs do not appear in the validation set, testing set or real application dataset. When the model is given user–item pairs that it has never seen, it does not know how to extract useful features. Therefore, we propose the ADGITN model, which learns two teacher signals for the user and item separately. These signals are the attention distribution learned from U2I by a submodel and auxiliary tasks. These signals can guide the learning of the attention distribution of the main model. Since we have two signals for the user and item separately, our model can perform well as long as it has seen the user and item previously, even though it has never seen the user–item pair. It is easy to guarantee the condition that the model has seen the user and the item in real application scenarios as long as we update the model every few days. Incidentally, we do not consider the problem of cold start.
Contextual attributes usually play an important role in recommendation algorithms. These kinds of attributes, such as user region, device type, user age, and user sex, contain additional information that can reflect the characteristics of users and items. The performance of the model will be better if the context-aware conditions are involved. However, to be fair, we must be consistent with extant research that uses only the review information.
The main contributions of this paper are as follows:
- •
We propose a model that conforms to real application scenarios. To the best of our knowledge, our ADGITN is the first model to apply the attention distribution from U2I to guide the extraction of features from user reviews and item reviews. This represents a novel idea for tasks whose training set, validation set and testing set have different data distributions.
- •
We conduct mass experiments on four benchmark datasets, and our model outperforms the deep learning models DeepCoNN, D-ATT, TransNet and MPCN.
- •
We exploit the attention distribution of U2I over user reviews and item reviews. We then select the most important reviews of the user and item that have the largest attention distribution weight to compare with the U2I review. This confirms the effectiveness of guidance by the attention distribution.
The remainder of this paper is organized as follows: related works will be discussed in Section 2. Section 3 will introduce our proposed model. The experiments and evaluation will be presented in Section 4. In Section 5, we perform some ablation analysis. We conduct an attention distribution analysis in Section 6, and finally, we conclude our work in Section 7.
Section snippets
Related work
With the rapid development of e-commerce, recommender systems have become a popular research area. We can take advantage of the rich data collected from e-commerce platforms to build many powerful models with machine learning and deep learning techniques. Recently, researchers have found increasingly more useful data, such as reviews that users write for items, sessions of users recorded by backends and other user behavior when browsing e-commerce platforms, that can be thoroughly mined. These
Proposed method
In this section, we present our proposed method attention distribution guide information transfer network (ADGITN), whose architecture is illustrated as follows. Fig. 1 is the architecture for the feature extraction model (FEM). Fig. 2 is the architecture for the validation and real-world application. Fig. 3 is the architecture for training. Specifically, our model has two different architectures for the training and validation processes.
In validation, we should simulate real-world application
Experiment and evaluation
In this section, we introduce benchmark datasets, our experimental setup and the evaluation. Our experiments will address the following research questions (RQs) in Sections 4 Experiment and evaluation, 5 Ablation analysis:
RQ1: What is the performance of our model compared with existing models such as DeepCoNN, D-ATT, TransNet and MPCN?
RQ2: Is the design of the model effective in extracting valuable information from a U2I to guide the learning of the main model? This question is also mentioned
Ablation analysis
In this section, we discuss the impact of key architectural features. We design six types of ablation experiments on the Yelp dataset, which are introduced as follows.
- •
NoU2I: To prove the effectiveness of attention distribution-guided information transfer, which is discussed in Section 3, and answer RQ2, we delete the auxiliary models that correspond to U2I during training. The blocks to be deleted contain two attention layers, the FEM block of U2I, two auxiliary MSE loss functions and two
Attention distribution analysis
In this section, we discuss the attention distribution’s contribution and answer RQ3. The key idea of our work is transmitting valuable information that the submodel extracts from the U2I review, which is obtained in training only in the main model by the attention distribution and . The two attention layers learn attention distributions for the target user and target item, respectively. The attention distributions are used to weight the encoded users and encoded items, respectively.
Conclusion
We proposed a review-based deep learning model for recommendation. Our validation process conforms to real-world application scenarios in which we cannot obtain the U2I review. To better utilize the valuable information of the U2I review, we designed four auxiliary tasks for training to extract information from the U2I review and transferred them to the main model via attention distributions. These attention distributions can be seen as two teacher signals: one for encoded users and one for
CRediT authorship contribution statement
Gang Sun: Conceptualization, Methodology, Software, Validation. Yu Li: Formal analysis, Writing - original draft, Validation. Hongfang Yu: Supervision, Funding acquisition, Project administration. Victor Chang: Writing - review & editing, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This research was partially supported by the PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (PCL2018KP001).
References (35)
- et al.
Matrix factorization techniques for recommender systems
Computer
(2009) - et al.
Probabilistic matrix factorization
Neural Inf. Process. Syst.
(2007) - C. Wang, D.M. Blei, Collaborative topic modeling for recommending scientific articles, in: The 17th ACM SIGKDD...
- L. Zheng, V. Noroozi, P.S. Yu, Joint deep modeling of users and items using reviews for recommendation, in: The Tenth...
- et al.
Transnets: Learning to transform for recommendation
(2017) - S. Seo, J. Huang, H. Yang, et al. Interpretable convolutional neural networks with dual local and global attention for...
- et al.
Joint deep model with multi-level attention and hybrid-prediction for recommendation
Entropy
(2019) - S. Rendle, Factorization machines, in: The 10th IEEE International Conference on Data Mining, 2010, pp....
- Y. Tay, A.T. Luu, S.H. Hui, Multi-pointer co-attention networks for recommendation, in: The 24th ACM SIGKDD...
- Y. Zhang, T. Xiang, T.M. Hospedales, et al. Deep mutual learning, in: IEEE Conference on Computer Vision and Pattern...
Distilling the knowledge in a neural network
Deep session interest network for click-through rate prediction
Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer
BERT: Pre-training of deep bidirectional transformers for language understanding
Language models are unsupervised multitask learners
OpenAI Blog
Cited by (3)
Soft computing for recommender systems and sentiment analysis
2022, Applied Soft ComputingImpact of word embedding models on text analytics in deep learning environment: a review
2023, Artificial Intelligence ReviewCustomer data extraction techniques based on natural language processing for e-commerce business analytics
2021, CITISIA 2021 - IEEE Conference on Innovative Technologies in Intelligent System and Industrial Application, Proceedings