4.1 Experimental Setup
4.1.1 Datasets.
To evaluate the performance of the proposed
RocSE, we conduct extensive experiments on four real-world recommendation datasets:
MovieLens-1M (ML-1M) [
15], Gowalla [
8], Yelp,
1 and Amazon Books [
30]. These datasets vary in domains, scale, and density. Following NCL [
26], for Yelp and Amazon Books, we filter out users/items with fewer than 15 interactions; for Gowalla, we filter out users/items with fewer than 10 interactions. We also drop out the interactions with ratings smaller than 3 in ML-1M, Yelp, and Amazon Books. We summarize the statistics including the number of users, the number of items, the number of interactions, and the density of all datasets in Table
3. For each dataset, we randomly divide the interactions into the training set, validation set, and test set with a ratio of 8:1:1. We uniformly sample one negative item for each positive instance from the training set.
To further simulate the scenario where noisy interactions are ubiquitous in the real world, we artificially add noisy interactions to each dataset to build the noisy datasets. Specifically, for each dataset, we first randomly select 10% and 20% user-item interactions in the training set; then, for each selected interaction \(\lt u, i\gt\), we randomly sample an item \(j\) that user \(u\) has not interacted with to create a new interaction \(\lt u, j\gt\) as a noisy interaction. We only add the noise into the training set, while keeping the validation set and testing set unchanged. Taking ML-1M as an example, we denote the versions with 10% and 20% injected noisy interactions as ML-1M-10% and ML-1M-20%, respectively.
4.1.2 Compared Models.
We compare the proposed method with the following methods:
•
NGCF [
49] is a graph-based CF method that largely follows the standard GCN. It additionally encodes the second-order feature interaction into the message during message passing.
•
LightGCN [
16] devises a light-weight graph convolution without feature transformation and non-linear activation, which is more simple and more efficient.
•
T-CE [
47] is a state-of-the-art sample re-weighting method for the robust recommendation, which uses the Truncated BCE to prune noisy interactions. It is originally designed for BCE loss only, and we extend it with CDAE [
55] for better performance.
•
DeCA [
51] a newly proposed robust recommender, which considers the disagreement predictions of noisy samples across different models and minimizes the KL-divergence between the two models’ predictions to enhance the robustness. We implement it based on LightGCN [
16].
•
SGL [
54] uses self-supervised learning to learn a more effective and robust model, which designs different graph views to mine hard negatives and denoise noisy interactions in implicit feedback. We implement SGL-ED, which is the suggested version in the original paper for comparison.
•
NCL [
26] is a newly proposed neural graph CF method, which considers the neighbor of users (or items) from the two aspects of graph structure and semantic space to form the views for contrastive learning. We implement it based on LightGCN.
•
SimGCL [
63] is another newly proposed GNN-based CF method, which utilizes contrastive learning as an auxiliary task. It builds different data augmentations by adding directed random noises to the representation and has made a state-of-the-art performance.
Here, we mainly consider the GNN-based methods as baselines for two reasons. First, neural graph collaborative filtering is mainly built upon GNNs and this is also the focus of this work. Second, compared to the traditional methods, such as BPRMF [
38] and NeuMF [
17], neural graph collaborative filtering methods have shown that they can achieve better results, as they encode the high-order information of bipartite graphs into representations [
16,
49,
54].
4.1.3 Evaluation Metrics.
We evaluate the
\(top-N\) recommendation performance by using two widely used metrics
\(Recall@N\) and
\(NDCG@N\), where
\(N\) is set to 10 and 20 for consistency. Following References [
16,
54], we adopt the full-ranking strategy [
66], which ranks all the candidate items that the user has not interacted with.
4.1.4 Implementations.
For all the compared models, we either use the source code provided by their authors (if it exists) or implement them ourselves with RecBole [
67], which is a unified open-source framework for developing and replicating recommendation algorithms. To ensure fairness, we fix the embedding size and batch size to 64 and 4,096, respectively. We optimize all the models by using Adam optimizer and the parameters are initialized by the Xavier distribution. We adopt the early stopping strategy to prevent overfitting, i.e., we stop training if the evaluation metric (e.g., NDCG@10) on the validation set does not increase for 10 epochs. For each compared method, we refer to the best hyper-parameters in their paper and then fine-tune them carefully to achieve the best results we can have (note that most of the compared methods perform even better after this step).
For the proposed RocSE, we fix \(\lambda _{2}\) to \(1e-4\), which is the same with the compared models. We tune the hyper-parameter \(\tau\) in \(\lbrace 0.05, 0.1, 0.2, 0.5, 1.0\rbrace\), \(\theta\) in \(\lbrace 0.2, 0.3, 0.4, 0.5, 0.6, 0.7\rbrace\), \(\lambda _{1}\) in \(\lbrace 0.1, 0.2, 0.5, 1.0, 2.0, 5.0 \rbrace\), and \(\epsilon\) in \(\lbrace 0.01, 0.05, 0.1, 0.2, 0.5, 1.0 \rbrace\) on the validation set.
4.2 Overall Performance
We compare the overall performance of all the methods on different datasets. The comparison results of
clean datasets (the original datasets without additional noisy interactions),
noisy datasets with 10% noisy interactions, and 20% noisy interactions are shown in Tables
4,
5, and
6, respectively.
First, from Table
4, we can observe that
RocSE achieves the best performance on all
clean datasets and outperforms the state-of-the-art method SimGCL with a noticeable margin on the Amazon Books dataset (e.g., up to 8.25% relative improvement). Compared with NGCF and LightGCN, DeCA performs comparably, and T-CE has better performance on Yelp and Amazon datasets, but drops on other two datasets. The other three methods SGL, NCL, and SimGCL all utilize contrastive learning as an auxiliary task, which makes a significant improvement in performance. These three methods construct the views of contrastive learning in different ways. SGL augments different interaction graphs with edge dropping; NCL constructs views by using structure neighbors and semantic neighbors in the interaction graph; SimGCL builds different views by adding random representation noise. These different strategies also achieved different effects on different datasets. For example, NCL has considerable performance on ML-1M and Yelp, but drops on Amazon. A possible reason is that NCL is more suitable for datasets with denser interactions where user (item) neighbors are more informative. SGL has a good improvement in each dataset compared to the LightGCN baseline, and the overall performance of SimGCL is the best among three methods. The probable reason is that constructing a contrastive view by adding random noise in the embedding space can make the representations of users and items more uniform. It is worth noting that the second part (embedding space perturbation) of our work is similar to SimGCL, which also introduces noises for representations. The main difference of our method is that it mimics the attacking behavior by using the existing user/item embeddings as perturbations and further considers denoising in the structure space.
Second, Tables
5 and
6 show the results where extra noisy interactions are injected into the training data. We can first observe that
RocSE still achieves improvements on all the datasets compared to the existing neural graph collaborative filtering methods. For example, on the Amazon dataset,
RocSE improves the best competitor (SimGCL) by up to 14.32% and 19.61% on the NDCG@10 metric when 10% and 20% noise are randomly injected, respectively. Additionally,
RocSE achieves higher relative improvements in nearly all the cases when noises are injected compared to the clean setting, and the relative improvements are further enlarged with more noises. This means that our
RocSE is more robust compared to the existing methods when there are more noises in the data. Moreover, observing the performance of
RocSE across the datasets, we find that
RocSE tends to be more effective on sparser and larger datasets, which are more common in the real world. We conjecture that there are two possible reasons for this observation. First, denser datasets contain more reliable information, making them more robust against noises. This is also supported by the observation that all the methods perform relatively better on the ML-1M dataset, which is extremely denser (at least around
\(30\times\) denser than the other three datasets). Second, for a large and sparse interaction graph, noisy interactions may have a greater impact on the neighbor nodes. Considering the case when most of the edges for a given node are noisy interactions, it would be extremely difficult to make correct predictions for this given node. As to the existing methods, we can also observe that contrastive learning mitigates the effect of noise to a certain extent. For example, SimGCL, NCL, and SGL all outperform LightGCN with a relatively large margin. Among them, SimGCL also seems to be more effective under the interference of noise compared with SGL and NCL. But still, they are less effective than
RocSE.
To more intuitively show the negative impact of noisy interactions, we compare the performance degradation of all methods after adding noises. As shown in Table
7, we summarize the drop points and drop rates of the NDCG@10 metric for all the compared methods after adding 10% and 20% noisy interactions. By comparing the results, we can easily find that almost all methods, especially LightGCN and NGCF, have a cliff-like decline of their performance with the increase of additional noisy interactions. Specifically, when adding 20% noisy interacions, both NGCF and LightGCN have a more than 20% performance drop on Yelp, Amazon Books, and Gowalla. An important reason may be that the message-passing mechanism of GNN exacerbates the negative effects of noisy interactions, which is also why we need to pay close attention to the model robustness of GNN-based collaborative filtering. With a dual denoising scheme, our proposed method successfully maintains performance degradation below 3% on all datasets, which is far better than the other methods in the latter three datasets. T-CE performs relatively well in terms of performance degradation on the ML-1M dataset. The probable reason is that the backbone CDAE [
55] used by T-CE is more stable for dense datasets (e.g., ML-1M) compared with LightGCN. Still, the performance of T-CE significantly drops for sparse datasets.
We next further examine whether
RocSE can truly identify noisy interactions. For this purpose, we test whether the trained model can successfully distinguish noisy interaction samples by calculating the prediction scores (i.e., Equation (
6)) for noisy and clean interactions separately. The prediction score for the interaction reflects how well the model adapts to the interaction, and a higher prediction score indicates that the model has higher confidence considering the corresponding interaction as a clean sample. The results of LightGCN and
RocSE on datasets with 20% noisy interactions are shown in Figure
3, where the blue boxes and orange boxes represent clean and noisy samples, respectively. As can be seen from the figure, compared with LightGCN, the prediction scores of clean samples and noise samples by
RocSE have a more obvious difference, which indicates that our method can distinguish noise samples from clean samples more effectively. For example, on the Amazon-20% dataset, the prediction scores of our method for the additional noisy interactions are kept in a very low range, which is consistent with the excellent performance of
RocSE in Table
6.
In summary, to answer RQ1 , the proposed RocSE outperforms the state-of-the-art neural graph collaborative filtering methods, and such improvements are further enlarged when there are injected noisy interactions in the training data. Furthermore, RocSE performs especially better compared to the existing methods when the datasets are large and sparse, which is the usual case in practice.
4.5 Efficiency
Finally, we evaluate the efficiency aspect of the proposed method. We compare the actual training time of the proposed
RocSE with LihgtGCN and SGL. The results are shown in Table
10, where all the results are collected on an Intel(R) Xeon(R) Silver 4110 CPU and a GeForce RTX 2080 GPU.
As shown in Table
10, we count the training time of every single epoch, the number of epochs to converge under the same early stopping strategy, and the total training time for all methods. We also calculate the multiple of
RocSE’s training time compared to LightGCN. We can observe from the table that the training time per epoch of our method is surprisingly reduced compared to SGL. The reason is that there is no reliance on graph structure augmentation to construct comparative learning views in our method. Compared with the LightGCN, although the training time in each epoch is three times slower,
RocSE only takes 42% and 21% extra training time on the larger Yelp-20% and Amazon-20% datasets. This is mainly due to the faster convergence speed of our method. Considering the benefit brought in the effectiveness and robustness aspects, such extra computational cost is affordable in practice.
In summary, to answer RQ4 , although additional denoising and perturbation modules are adopted, the proposed RocSE only incurs affordable extra computational cost compared with LightGCN. It runs even faster than SGL.