research-article

Public Access

Towards Robust Neural Graph Collaborative Filtering via Structure Denoising and Embedding Perturbation

Authors:

Haibo Ye,

Xinjie Li,

Yuan Yao,

Hanghang TongAuthors Info & Claims

ACM Transactions on Information Systems, Volume 41, Issue 3

Article No.: 59, Pages 1 - 28

https://doi.org/10.1145/3568396

Published: 07 February 2023 Publication History

All formats PDF

Abstract

Neural graph collaborative filtering has received great recent attention due to its power of encoding the high-order neighborhood via the backbone graph neural networks. However, their robustness against noisy user-item interactions remains largely unexplored. Existing work on robust collaborative filtering mainly improves the robustness by denoising the graph structure, while recent progress in other fields has shown that directly adding adversarial perturbations in the embedding space can significantly improve the model robustness. In this work, we propose to improve the robustness of neural graph collaborative filtering via both denoising in the structure space and perturbing in the embedding space. Specifically, in the structure space, we measure the reliability of interactions and further use it to affect the message propagation process of the backbone graph neural networks; in the embedding space, we add in-distribution perturbations by mimicking the behavior of adversarial attacks and further combine it with contrastive learning to improve the performance. Extensive experiments have been conducted on four benchmark datasets to evaluate the effectiveness and efficiency of the proposed approach. The results demonstrate that the proposed approach outperforms the recent neural graph collaborative filtering methods especially when there are injected noisy interactions in the training data.

1 Introduction

Nowadays, recommender systems have been widely used in many real-world applications for predicting personalized preferences [39]. As a simple yet effective technique, collaborative filtering (CF) is applied as one of the building blocks for many recommender systems. It fully considers the interaction history of users and items and infers the preferred items for users based on the intuition that users who have interacted with the same items tend to have similar characteristics and preferences.

Recently, harnessing the strength of graph neural networks (GNNs) [56], neural graph collaborative filtering has received extensive attention and achieved state-of-the-art performance (see the Related Work section for a brief review). Unlike previous CF methods, neural graph collaborative filtering models the user-item interactions as graphs and learns the high-order representations with GNNs by recursively propagating and aggregating the messages along the graph structure.

Although encouraging performance has been achieved, existing neural graph collaborative filtering methods seldom consider the robustness aspect, i.e., how would they perform when there are many noisy or even malicious user-item interactions? For example, in online shopping services, users may accidentally click on products that are not part of their preferences (e.g., random clicks or proxy shopping) [4, 47]; such clicks do not necessarily indicate his/her true personal preferences, while may be regarded as valid feedback in CF. What makes things even worse is that malicious users can deliberately insert crafted fake interactions to bias the recommender systems in their interests [9, 14, 25, 34]. If not regularized properly, then the message-passing scheme of GNNs may aggregate misleading neighborhood information and thus is likely to lead to sub-optimal recommendation performance. Therefore, it is crucial to enhance the robustness of neural graph collaborative filtering under noises.

In the literature, there are some studies denoising implicit feedback to improve the robustness of recommender systems. These efforts are mainly applied to the sample space, and they can be roughly divided into two categories: sample selection methods [11, 52, 65] and sample re-weighting methods [47, 51]. Sample selection methods focus on designing more effective samplers to collect clean samples, and sample re-weighting methods aim to distinguish noisy interactions from clean data and assign lower weights to the noisy interactions during training. However, these methods do not directly affect the message-passing process of GNNs, and the negative impact of the aggregated noise from the high-order neighborhood remains uncontrolled. In addition to the sample space, recent progress in computer vision and natural language processing has shown that directly adding adversarial perturbations in the embedding space can significantly boost the model’s robustness [33, 40]. This idea is explored by a recent work observing that perturbations in the embedding space can also improve the recommendation performance [63]. However, this work simply adds random noise into embeddings that may not work as well compared to adversarial perturbations. Additionally, none of the existing work has attempted to combine the structure space and embedding space to maximally improve the robustness of neural graph collaborative filtering.

In this article, we propose an end-to-end neural graph collaborative filtering method, RocSE. RocSE aims to improve the model robustness from the perspectives of both sample (structure) space and embedding space. Specifically, in each training epoch, RocSE first denoises the structure of the user-item graph by identifying more reliable interactions and further uses the reliability score to affect the message propagation process of the backbone GNNs. Second, RocSE adds in-distribution perturbations in the embedding space to mimic the adversarial attacks that manipulate the graph structure via inserting edges and further combines it with contrastive learning to effectively improve the performance. We evaluate the effectiveness and efficiency of the proposed method in four real-world datasets and compare it with other state-of-the-art neural graph collaborative filtering methods. The results show that our method achieves the state-of-the-art performance, improving the accuracy of the best competitor by up to 8.25% in the original datasets. We further inject random noises into the datasets, and the results show that our method achieves even larger improvement, leading up to a 19.61% relative improvement compared to the best competitor. Through an ablation study, we also show that our embedding perturbation achieves better robustness results compared with our structure denoising.

Our main contributions are summarized as follows:

•

We propose a new neural graph collaborative filtering method to alleviate the side effect of noisy interactions in recommender systems. The proposed method improves the robustness through both structure space and embedding space. To the best of our knowledge, this is the first work that considers improving neural graph collaborative filtering robustness from both structure and embedding spaces.

•

We propose an in-distribution embedding perturbation method under the contrastive learning framework for neural graph collaborative filtering. It treats the embeddings of other users as perturbations to mimic the attacking behavior to further improve robustness.

•

Extensive experiments are conducted, demonstrating that our method is more effective compared with the state-of-the-art neural graph collaborative filtering methods and more robust when there are many intentionally injected noisy interactions.

The rest of the article is organized as follows: Section 2 introduces the preliminary knowledge. Section 3 presents the proposed approach, and Section 4 shows the experimental results. Section 5 covers the related work, and Section 6 concludes the article.

2 Preliminary

In this section, we present some preliminary knowledge. We first introduce the notations and then describe typical neural graph collaborative filtering methods.

2.1 Notations

Before presenting the proposed approach, we first introduce the notations used throughout the article in Table 1. Following conventions, we use bold capital letters for matrices, bold lowercase letters for vectors, and calligraphic letters for sets. The user set and item set are denoted by \(\mathcal {U}\) and \(\mathcal {I}\), respectively. The interaction matrix of existing user-item interactions is \(\mathbf {R} \in \lbrace 0,1\rbrace ^{|\mathcal {U}| \times |\mathcal {I}|}\), where \(|\mathcal {U}|\) and \(|\mathcal {I}|\) represent the number of users and items, respectively. The entry \(r_{u,i} = 1\) if there exists an observed interaction between user \(u\) and item \(i\), and \(r_{u,i} = 0\) otherwise. In other words, we mainly consider learning the user preferences from the implicit feedback in this article. By default, we use \(u\), \(v\) to indicate users and \(i\), \(j\) to indicate items. The user embedding matrix and item embedding matrix are denoted as \(\mathbf {E}_u\) and \(\mathbf {E}_i\), respectively. We use superscript to denote the layer of the embeddings. Note that neural graph collaborative filtering typically learns the initial embeddings \(\mathbf {E}^{(0)}\) for users/items, and the aggregated result of the \(l\)th layer is denoted as \(\mathbf {E}^{(l)}\).

Table 1.

Symbol	Description
\(\mathcal {U},\mathcal {I}\)	the set of users, items
\(\mathbf {R}\)	the user-item matrix of size \(\|\mathcal {U}\| \times \|\mathcal {I}\|\)
\(r_{u,i}\)	the interaction/entry between user \(u\) and item \(i\) in \(\mathbf {R}\)
\(c_{u,i}\)	the cleanness score of \(r_{u,i}\)
\(G\)	the user-item interaction graph of size \((\|\mathcal {U}\|+\|\mathcal {I}\|) \times (\|\mathcal {U}\|+\|\mathcal {I}\|)\)
\(\mathcal {N}_{u}, \mathcal {N}_{i}\)	the neighbor nodes of user \(u\) and item \(i\) in the interaction graph
\(\mathbf {A}\)	the adjacency matrix of \(G\)
\(\tilde{G}\)	the denoised user-item interaction graph
\(\mathbf {E}^{(l)}\)	the embedding matrix of users/items in the \(l\)th layer
\(\mathbf {e}^{(l)}\)	the embedding vector of the \(l\)th layer for users/items

Table 1. Notations

2.2 Neural Graph Collaborative Filtering

As a fundamental component in recommender systems, collaborative filtering (CF) aims to model user preferences based on observed feedback. Based on the interaction matrix \(\mathbf {R}\), neural graph collaborative filtering first constructs a user-item interaction graph \(\mathbf {G}\), whose corresponding adjacency matrix \(\mathbf {A}\) can be defined as:

\begin{equation} \mathbf {A} = \begin{bmatrix} \mathbf {0} & \mathbf {R} \\ \mathbf {R} ^ \top & \mathbf {0} \end{bmatrix}. \end{equation}

(1)

This adjacency matrix \(\mathbf {A}\) is usually used as the input for the neural graph collaborative filtering.

In general, neural graph collaborative filtering is built upon GNNs [16, 49], obtaining the high-level representations for users and items based on the message aggregation of low-level representations. The aggregation process can be formulated as the following two stages:

\begin{equation} \begin{aligned}\mathbf {e}_{u}^{(l)} &=f_{\text{aggregate }}\left(\left\lbrace \mathbf {e}_{i}^{(l-1)} \mid i \in \mathcal {N}_{u} \cup \lbrace i\rbrace \right\rbrace \right), \\ \mathbf {e}_{u} &=f_{\text{readout }}\left(\left[\mathbf {e}_{u}^{(0)}, \mathbf {e}_{u}^{(1)}, \ldots , \mathbf {e}_{u}^{(L)}\right]\right), \end{aligned} \end{equation}

(2)

where \(\mathcal {N}_{u}\) denotes all the neighbors of user \(u\) in the interaction graph \(G\) and \(L\) denotes the number of GNN layers. Here, \(\mathbf {e}_{u}^{(0)}\) denotes the direct embedding of user \(u\), which is also the parameter to be learned. For each user \(u\), the aggregation function \(f_{\text{aggregate}}\) generates its high-level embedding \(\mathbf {e}_{u}^{(l)}\) of the \(l\)th layer by aggregating all the neighbors’ embeddings of the \((l-1)\)-th layer. After the same operation of \(L\) layers, we have all layers’ embeddings for user \(u\) as \([\mathbf {e}_{u}^{(0)}, \mathbf {e}_{u}^{(1)}, \ldots , \mathbf {e}_{u}^{(L)}]\). Then, the readout function \(f_{\text{readout}}\) summarizes all layers’ embeddings to form the final high-level embedding. Generally speaking, the \(sum\) function (i.e., \(\mathbf {e}_{u}=\mathbf {e}_{u}^{(0)}+\cdots +\mathbf {e}_{u}^{(L)}\)) and \(concatenation\) function (i.e, \(\mathbf {e}_{u}=(\mathbf {e}_{u}^{(0)}\Vert \cdots \Vert \mathbf {e}_{u}^{(L)})\)) are usually used in practice.

In the following, we take LightGCN [16] as an example to facilitate understanding. Specifically, LightGCN discards the feature transformation and the non-linear activation layer in the aggregation function and conducts message aggregation as:

\begin{equation} \begin{array}{l} \mathbf {e}_{u}^{(l+1)}=\displaystyle \sum \limits _{i \in \mathcal {N}_{u}} \frac{1}{\sqrt {\left|\mathcal {N}_{u}\right|\left|\mathcal {N}_{i}\right|}} \mathbf {e}_{i}^{(l)}, \\ \mathbf {e}_{i}^{(l+1)}=\displaystyle \sum \limits _{u \in \mathcal {N}_{i}} \frac{1}{\sqrt {\left|\mathcal {N}_{i}\right|\left|\mathcal {N}_{u}\right|}} \mathbf {e}_{u}^{(l)}, \end{array} \end{equation}

(3)

where \(\mathcal {N}_{u}\) and \(\mathcal {N}_{i}\) represent the one-hop neighbors of user \(u\) and item \(i\), respectively. In addition, it can be expressed with the matrix calculation form as follows:

\begin{equation} \mathbf {E}^{(l+1)}=\left(\mathbf {D}^{-\frac{1}{2}} \mathbf {A} \mathbf {D}^{-\frac{1}{2}}\right) \mathbf {E}^{(l)}. \end{equation}

(4)

Here, \(\mathbf {D}\) is the degree matrix of \(\mathbf {A}\) that serves the purpose of normalization, \(\mathbf {E}^{(l)}\) denotes the embeddings for users and items after \(l\) layers’ aggregation operation of GNN, and \(\mathbf {E}^{(0)}\) denotes the learnable embeddings of users and items.

After propagating \(L\) layers, LightGCN utilizes the weighted sum function as the readout function to combine the embeddings of all layers and calculate the final embeddings as follows:

\begin{equation} \mathbf {e}_{u}=\frac{1}{L+1} \sum _{l=0}^{L} \mathbf {e}_{u}^{(k)}, \quad \mathbf {e}_{i}=\frac{1}{L+1} \sum _{l=0}^{L} \mathbf {e}_{i}^{(k)}, \end{equation}

(5)

where \(\mathbf {e}_u\) and \(\mathbf {e}_i\) denote the final representations of user \(u\) and item \(i\), respectively. Then, the inner product between them is calculated to predict the preference score of user \(u\) towards item \(i\):

\begin{equation} \hat{y}_{u, i}=\mathbf {e}_{u}^{\top } \mathbf {e}_{i}. \end{equation}

(6)

3 The Proposed Approach

In this section, we present the proposed approach, a Robust neural graph collaborative filtering method via Structure denoising and Embedding perturbation (RocSE). The overview of RocSE is shown in Figure 1. RocSE adopts a multi-task framework and achieves robustness via perturbations in both structure space and embedding space. Specifically, in each iteration, RocSE first computes a cleanness score based on the updated user/item embeddings in the last iteration. This cleanness score reflects the reliability of the edges in the interaction graph, based on which the structure denoising is applied. On the denoised graph, we then train a backbone GNN-based recommender as the first training task. Next, the denoised graph as well as the node/item embeddings from the backbone GNN are fed into the embedding space perturbation module. In this module, the basic idea is to directly add the embeddings of other users/items as the perturbations for the current users/items. This is to mimic the case when adversarial attackers insert or delete edges in the graph to conduct an attack. Such topology attacks essentially change the aggregation process that is applied in the embedding space. For the second training task, we construct multiple views by adding different embedding perturbations and then apply contrastive learning to constrain their consistency. The above two training tasks are jointly optimized and the updated user/item embeddings after backpropagation are fed into the next iteration. The above process is repeated until convergence.

Fig. 1.

In the following, we explain the two modules in detail.

3.1 Structure Denoising Module

This module mainly aims to identify some suspicious edges in the existing interaction graph and reduce their negative impact on message propagation. We next show how we denoise the interaction graph between users and items, which mainly consists of the following two steps: (1) cleanness score computation, (2) interaction denoising.

3.1.1 Cleanness Score Computation.

To spot the suspicious interactions in the user-item graph, we resort to the homophily theory [31] (i.e., similar individuals tend to connect together). In fact, this is also the key building block of collaborative filtering, as similar users tend to have similar preferences. The remaining problem is how to compute the similarity between the user and the item for a given user-item interaction. There are two issues with such computation. First, user embeddings and item embeddings may not be directly comparable, and thus directly computing their similarity may result in inaccurate results. Second, the embedding of a single user/item may not be reliable, especially when there exist noisy interactions.

To address the above two issues, we first propose to compute the user/item embeddings employing the neighborhood information. That is, we use the aggregated embedding from the local neighborhood instead of the user/item embedding as the input for similarity computation. The intuition is that the neighborhood community as a whole can better reflect the real characteristics of users and items under the influence of noise. Additionally, when aggregating the neighborhood, we propose to aggregate only user embeddings to represent both users and items. In other words, we transfer the similarity computation to the comparison between user embeddings, which is comparable in nature.

Specifically, for an item \(i\), we use its neighbor users that have interacted with \(i\) to represent it. That is,

\begin{equation} \mathbf {e}_{i}^{(1)}=\sum _{u \in \mathcal {N}_{i}} \frac{1}{\sqrt {\left|\mathcal {N}_{i}\right|\left|\mathcal {N}_{u}\right|}} \mathbf {e}_{u}^{(0)}, \end{equation}

(7)

where \(\mathbf {e}_u^{(0)}\) stands for the the learnable embeddings of users. For a user \(u\), we use its neighbor users that have interacted with at least one common item with \(u\) to represent it, i.e., computing \(\mathbf {e}_{u}^{(2)}\) as

\begin{equation} \mathbf {e}_{u}^{(2)}=\sum _{i \in \mathcal {N}_{u}} \frac{1}{\sqrt {\left|\mathcal {N}_{u}\right|\left|\mathcal {N}_{i}\right|}} \mathbf {e}_{i}^{(1)}, \end{equation}

(8)

where \(\mathbf {e}_{i}^{(1)}\) is computed in the previous equation. Note that the above equation essentially adds a weight to make the importance of each neighbor user proportional to the number of their common items. For example, if the second-order user and the current user node have two common items, then the embedding of the second-order user will be computed twice. We then estimate the cleanness score by adopting the cosine similarity between user \(u\) and item \(i\) as follows:

\begin{equation} \cos \left(\mathbf {e}_{u}^{(2)}, \mathbf {e}_{i}^{(1)}\right)=\frac{\mathbf {e}_{u}^{(2) \top } \mathbf {e}_{i}^{(1)}}{\left\Vert \mathbf {e}_{u}^{(2)}\right\Vert _{2} \cdot \left\Vert \mathbf {e}_{i}^{(1)}\right\Vert _{2}}. \end{equation}

(9)

Since the cosine value in Equation (9) can be negative, we normalize the value to \([0,1]\) as follows:

\begin{equation} c_{u, i}=\left(\cos \left(\mathbf {e}_{u}^{(2)}, \mathbf {e}_{i}^{(1)}\right)+1\right) / 2. \end{equation}

(10)

We adopt \(c_{u, i}\) as the cleanness score for the observed interaction. A larger \(c_{u,i}\) indicates a more reliable interaction between user \(u\) and item \(i\).

3.1.2 Denoising the Graph Structure.

After obtaining the estimated cleanness score of each observed user-item interaction, we treat it as the importance weight to denoise the user-item graph \(G\). We denote the resulting denoised graph as \(\tilde{G}\). Specifically, we utilize the cleanness score \(c_{u,i}\) as follows, which results in a denoised interaction weight \(\tilde{r}_{u,i}\):

\begin{equation} \tilde{r}_{u, i}=\mathbb {I}\left(c_{u, i}\gt \theta \right) \cdot c_{u, i}, \end{equation}

(11)

where \(\mathbb {I}\) is an indicator function that returns 1 if the condition is true, and \(\theta\) is a pre-defined hyper-parameter controlling the threshold. If the cleanness score \(c_{u, i}\) of the interaction between user \(u\) and item \(i\) is lower than the threshold \(\theta\), then we conduct a hard denoising strategy by directly dropping the interaction edge. However, when the cleanness score value is larger than \(\theta\), we re-weight the corresponding value \(r_{u, i}\) in the interaction matrix as \(c_{u, i}\), which reduces the impact of noise propagating to high-order neighbors with a soft strategy.

3.1.3 Training on the Denoised Graph.

With the denoised interaction graph \(\tilde{G}\), we use LightGCN [16] as the backbone GNN and utilize its message-passing scheme to aggregate and summarize embeddings. We also adopt the Bayesian Personalization Ranking (BPR) loss [38] to learn the embeddings of \(\mathbf {e}_u^{(0)}\) and \(\mathbf {e}_i^{(0)}\) for each user and item. Specifically, we sample one negative sample from unobserved user-item interactions for each observed user-item observation to construct the pairwise training data and optimize the following objective function:

\begin{equation} \mathcal {L}_{\text{BPR }}=\sum _{(u, i, j) \in \mathcal {O}}-\log \sigma \left(\hat{y}_{u i}-\hat{y}_{u j}\right), \end{equation}

(12)

where \(\mathcal {O}=\lbrace (u, i, j) \mid u \in \mathcal {U}, i \in \mathcal {I}, j \in \mathcal {I},{r}_{u,i} \ne 0, {r}_{u,j}=0\rbrace\) and \(\sigma\) denotes the sigmoid function. BPR loss promotes the observed user-item interactions to obtain higher prediction scores than the unobserved ones.

3.2 Embedding Space Perturbation Module

Previous studies have mainly been organized to alleviate the negative impact of noisy samples in recommender systems from the perspective of sample selection or re-weighting in the graph structure space. Here, we propose to further enhance the robustness of the model by first adding noisy perturbations in the latent embedding space and then applying contrastive learning to learn more robust embeddings. The process is depicted in Figure 2.

Fig. 2.

3.2.1 Embedding Perturbation.

Inspired by the existing adversarial attacks in GNNs [10, 59, 70, 71] that mainly manipulate the graph structure by adding edges, we propose to mimic such attacks and incorporate it into the training process by directly adding perturbations in the latent embedding space. To do this, as shown in Figure 2, we propose to first utilize the random shuffle operation on the embedding matrix \(\mathbf {E}^{(0)}\) and then add the shuffled embeddings into the original ones. Specifically, let \(\mathbf {E}^{(0)} = (\mathbf {E}_{u}^{(0)}, \mathbf {E}_{i}^{(0)})\) be the initial embedding matrix of all users and items, we have

\begin{equation} \begin{aligned}\tilde{\mathbf {E}}_{u}^{(0)} = \mathbf {E}_{u}^{(0)} + \epsilon \cdot f_{\text{norm } } \left(f_{\text{shuffle }}\left(\mathbf {E}_{u}^{(0)} \right) \right) ,\\ \tilde{\mathbf {E}}_{i}^{(0)} = \mathbf {E}_{i}^{(0)} + \epsilon \cdot f_{\text{norm } } \left(f_{\text{shuffle }}\left(\mathbf {E}_{i}^{(0)} \right) \right) , \end{aligned} \end{equation}

(13)

where \(f_{\text{shuffle}}\) is the shuffle operation that randomly shuffles the embeddings for all users and items. In other words, we introduce irrelevant embeddings from other users or items as noise perturbations. This strategy may be more effective against attacks, as the added perturbations are from the real distributions from existing users/items. The function \(f_{\text{norm}}\) represents the normalization operation on the embedding vectors and \(\epsilon\) is a hyper-parameter that controls the magnitude of the perturbation.

Furthermore, we conduct the above perturbation in all layers to obtain the final perturbed embeddings, which can be written as follows:

\[\begin{eqnarray} \tilde{\mathbf {E}}^{(l)}&=&\tilde{\mathbf {A}} \tilde{\mathbf {E}}^{(l-1)} + \Delta ^{(l)}, \nonumber \nonumber\\ \Delta ^{(l)} &=& \epsilon \cdot f_{\text{norm } } \left(f_{\text{shuffle }}\left(\tilde{\mathbf {A}} \tilde{\mathbf {E}}^{(l-1)} \right) \right),\nonumber \nonumber\\ \mathbf {E}^{\prime } &=& \frac{1}{L} \left(\tilde{\mathbf {E}}^{(1)} + \tilde{\mathbf {E}}^{(2)} + \dots + \tilde{\mathbf {E}}^{(L)} \right), \end{eqnarray}\]

(14)

where \(\tilde{\mathbf {A}}=\mathbf {D}^{-\frac{1}{2}} \hat{\mathbf {A}} \mathbf {D}^{-\frac{1}{2}}\) and \(\hat{\mathbf {A}}\) is the adjacency matrix of denoised interaction graph; \(\tilde{\mathbf {E}}^{(l)}\) denotes the perturbed embeddings of the \(l\)th layer and \(\tilde{\mathbf {E}}^{(0)} = \mathbf {E}^{(0)}\); and \(\Delta ^{(l)}\) is the noise perturbation of the \(l\)th layer. It is worth noting that we randomly use different shuffles for each layer, which adds different embeddings as noise perturbations for each user/item. In a nutshell, we obtain the noisy embeddings \(\mathbf {E}^{\prime }\) as final representations by adding the shuffled embeddings in each layer. Following SimGCL [63], we skip the input embedding \(\mathbf {E}^{(0)}\) when summarizing perturbed embeddings in all layers for the final perturbed representations due to the empirical improvement of experimental performance.

3.2.2 Mutual Information Maximization.

Inspired by the recent contrastive learning studies [26, 54, 63], we expect to learn a more robust model based on Mutual Information Maximization (MIM) [36, 44] among multiple views constructed based on the above embedding perturbations. We regard the embeddings for the same user/item under different views as positive pairs, and the embeddings from different nodes as negative pairs.

Specifically, we first obtain two perturbed representations as \(\mathbf {E}^{\prime }\) and \(\mathbf {E}^{\prime \prime }\) based on Equation (14). It should be mentioned that we use different random shuffles for each perturbed version. It is usually intractable to directly maximize mutual information. Therefore, we adopt the contrastive learning as an auxiliary task and follow the InfoNCE loss [43] to optimize a lower bound of Mutual Information, which can be written as follows:

\begin{equation} \mathcal {L}_{cl}^{user}=\sum _{u \in \mathcal {U}}-\log \frac{\exp \big (cos\big (\mathbf {e}_{u}^{\prime }, \mathbf {e}_{u}^{\prime \prime }\big) / \tau \big)}{\sum _{j \in \mathcal {U}} \exp \left(cos\left(\mathbf {e}_{u}^{\prime }, \mathbf {e}_{j}^{\prime \prime }\right) / \tau \right)}, \end{equation}

(15)

where \(\mathbf {e}_{u}^{\prime }\) and \(\mathbf {e}_{u}^{\prime \prime }\) are the perturbed representations of user \(u\) in \(\mathbf {E}^{\prime }\) and \(\mathbf {E}^{\prime \prime }\), respectively; \(\tau\) is the hyper-parameter, also known as the temperature in softmax; and \(cos(\cdot {})\) denotes the cosine similarity function. Analogously, we can obtain the InfoNCE loss for the item side as:

\begin{equation} \mathcal {L}_{cl}^{item}=\sum _{i\in \mathcal {I}}-\log \frac{\exp \left(cos\left(\mathbf {e}_{i}^{\prime }, \mathbf {e}_{i}^{\prime \prime }\right) / \tau \right)}{\sum _{j \in \mathcal {I}} \exp \left(cos\left(\mathbf {e}_{i}^{\prime }, \mathbf {e}_{j}^{\prime \prime }\right) / \tau \right)}. \end{equation}

(16)

Finally, we combine the above two loss functions to form the objective function of the contrastive learning task as:

\begin{equation} \mathcal {L}_{cl} = \mathcal {L}_{cl}^{user} + \mathcal {L}_{cl}^{item}. \end{equation}

(17)

The idea of constructing contrastive learning views by adding perturbations in the embedding space has been tried in recent work SimGCL [63]. The difference is that SimGCL adds random perturbations while we use the in-distribution embeddings from other nodes as the perturbations. As will be shown later in the experiment, our method is more robust. Another related work is MixGCF [21], which mixes up candidate negative node embeddings with random positive node embeddings to synthesize hard negative samples. Our method differs from MixGCF in two aspects. First, we aim to construct different contrastive learning views while MixGCF aims to synthesize negative samples. Second, we consider embedding perturbations randomly from any other users/items, while MixGCF considers only adding embeddings from positive samples.

3.3 Training Algorithm and Analysis

Overall, the training loss of our method is

\begin{equation} \mathcal {L}=\mathcal {L}_{\mathrm{BPR}}+\lambda _{1} \mathcal {L}_{\mathrm{cl}}+\lambda _{2}\Vert \Theta \Vert _{F}^{2}, \end{equation}

(18)

where \(\lambda _{1}\) and \(\lambda _{2}\) are hyper-parameters to control the strengths of the contrastive loss in the embedding space and the regularization loss with Frobenius norm on the embedding matrices, respectively. \(\Theta = \mathbf {E}^{(0)} = (\mathbf {E}_{u}^{(0)}, \mathbf {E}_{i}^{(0)})\) includes the model parameters to be learned. The entire training process is carried out in an end-to-end manner, and the training pseudocode is summarized in Algorithm 1. First, we construct the denoised graph \(\tilde{G}\) based on the original interaction graph \(G\). This process does not generate gradient, and all subsequent learning losses are calculated based on the denoised graph \(\tilde{G}\). Then, we compute the BPR loss \(\mathcal {L}_{\text{BPR }}\) and the contrastive learning loss \(\mathcal {L}_{cl}\), which are to be jointly optimized.

3.3.1 Complexity Analysis.

In this part, we analyze the complexity of RocSE and compare it with baselines LightGCN [16] and SGL [54] (with edge dropping as the augmentation strategy). In terms of space complexity, since we are based on LightGCN and do not introduce additional trainable parameters, our space complexity is the same with LightGCN. Therefore, we mainly analyze the time complexity. In the following, we present a detailed analysis of the training time in every single epoch. Following References [54, 63], we consider the time complexity of forward propagation only, since the backward propagation usually shares the same complexity. Specifically, let \(|\mathcal {E}|\) be the number of edges in the interaction graph, \(d\) be the embedding size, \(B\) be the batch size, \(s\) be the training epochs, \(M\) be the node number in a batch, \(\rho\) be the keep rate of edges in SGL-ED, and \(\rho ^{\prime }\) be the keep rate of interaction edges in the structure denoise module of our RocSE. The major time consumption of our RocSE in every single training epoch mainly comes from three parts:

•

Structure denoising module. First, we calculate the neighbors’ embeddings for users and items with Equations (7) and (8), which equals computing the embedding matrix \(\mathbf {E}^{(1)}\) and \(\mathbf {E}^{(2)}\), and the time complexity is \(\mathcal {O}(|\mathcal {E}|d)\) for each epoch. Second, we obtain the cleanness score and apply denoising strategies according to Equations (9)–(11), and the time complexity is \(\mathcal {O}(|\mathcal {E}|d)\). Third, suppose that we directly drop \(1-\rho ^{\prime }\) percentage of interaction edges with a hard denoising strategy. We then normalize the adjacency matrix of the denoised graph and the complexity is \(\mathcal {O}(\rho ^{\prime }|\mathcal {E}|)\). Finally, we get the user/item representations with \(L\) layers graph convolution and the time complexity is \(\mathcal {O}(\rho ^{\prime } |\mathcal {E}| L d \frac{|\mathcal {E}|}{B})\).

•

Embedding perturbation module. To obtain the noisy representations of users and items, we add shuffled embeddings in every GNN layer according to Equation (14) and construct two random views for contrastive learning. The time consumption mainly comes from the graph convolution for two noisy views. Since the whole computation is based on the denoised graph, the time complexity is \(\mathcal {O}(\rho ^{\prime } |\mathcal {E}| Ld \frac{|\mathcal {E}|}{B})\).

•

Loss computation. To jointly optimize the structure denoising module and embedding space perturbation module, we calculate the joint learning loss according to Equations (12) and (15)–(17). When calculating the InfoNCE loss with Equation (17), we only consider negative pairs for users and items in the same batch, so the time complexity of BPR loss and InfoNCE loss is \(\mathcal {O}(|\mathcal {E}|d)\) and \(\mathcal {O}(|\mathcal {E}|d + |\mathcal {E}|Md)\), respectively.

The training time complexity results are summarized in Table 2, and the detailed proof is presented in Appendix A. Note that we deliberately include the constant number in the time complexity for fine-grained comparison. From the table, we can observe that the full-time complexity of our RocSE keeps the same order as LightGCN and is comparable to SGL, which is superior to the complexities of some recent denoising recommender systems [4, 12].

Table 2.

Component	LightGCN	SGL	RocSE
Structure denoising	-	-	\(\mathcal {O}(7\|\mathcal {E}\|ds + 2\|\mathcal {E}\|s)\)
Adjacency Matrix	\(\mathcal {O}(2\|\mathcal {E}\|)\)	\(\mathcal {O}(2\|\mathcal {E}\|+4 \rho \|\mathcal {E}\|s)\)	\(\mathcal {O}(2\|\mathcal {E}\|+2\rho ^{\prime }\|\mathcal {E}\|s)\)
Graph Convolution	\(\mathcal {O}(2\|\mathcal {E}\| Lds\frac{\|\mathcal {E}\|}{B})\)	\(\mathcal {O}((2+4 \rho)\|\mathcal {E}\| Lds \frac{\|\mathcal {E}\|}{B})\)	\(\mathcal {O}(6\rho ^{\prime }\|\mathcal {E}\| Lds\frac{\|\mathcal {E}\|}{B})\)
BPR Loss	\(\mathcal {O}(2\|\mathcal {E}\|ds)\)	\(\mathcal {O}(2\|\mathcal {E}\|ds)\)	\(\mathcal {O}(2\|\mathcal {E}\|ds)\)
CL Loss	-	\(\mathcal {O}(\|\mathcal {E}\|ds+\|\mathcal {E}\|Mds)\)	\(\mathcal {O}(\|\mathcal {E}\|ds + \|\mathcal {E}\|Mds)\)

Table 2. The Comparison of Time Complexity

4 Experiments

In this section, we present the experimental results. The experiments are mainly designed to answer the following questions:

•

RQ1. How does the proposed approach perform compared with the existing neural graph collaborative filtering methods, under both clean setting and noisy setting?

•

RQ2. How does each component of RocSE contribute to the overall performance?

•

RQ3. How stable is the proposed method w.r.t. the hyper-parameters including the softmax temperature \(\tau\) in Equation (15), the regularization coefficient \(\lambda _1\) in Equation (18), the perturbation magnitude \(\epsilon\) in Equation (13), and the denoise threshold \(\theta\) in Equation (11)?

•

RQ4. How efficient is the proposed method in the training stage, especially compared with the lightweight recommendation method LightGCN [16]?

4.1 Experimental Setup

4.1.1 Datasets.

To evaluate the performance of the proposed RocSE, we conduct extensive experiments on four real-world recommendation datasets: MovieLens-1M (ML-1M) [15], Gowalla [8], Yelp,¹ and Amazon Books [30]. These datasets vary in domains, scale, and density. Following NCL [26], for Yelp and Amazon Books, we filter out users/items with fewer than 15 interactions; for Gowalla, we filter out users/items with fewer than 10 interactions. We also drop out the interactions with ratings smaller than 3 in ML-1M, Yelp, and Amazon Books. We summarize the statistics including the number of users, the number of items, the number of interactions, and the density of all datasets in Table 3. For each dataset, we randomly divide the interactions into the training set, validation set, and test set with a ratio of 8:1:1. We uniformly sample one negative item for each positive instance from the training set.

Table 3.

Datasets	#Users	#Items	#Interactions	Density
ML-1M	6,040	3,629	836,478	0.03816
Gowalla	29,859	40,989	1,027,464	0.00084
Yelp	45,478	30,709	1,777,765	0.00127
Amazon Books	58,145	58,052	2,517,437	0.00075

Table 3. Statistics of Datasets

To further simulate the scenario where noisy interactions are ubiquitous in the real world, we artificially add noisy interactions to each dataset to build the noisy datasets. Specifically, for each dataset, we first randomly select 10% and 20% user-item interactions in the training set; then, for each selected interaction \(\lt u, i\gt\), we randomly sample an item \(j\) that user \(u\) has not interacted with to create a new interaction \(\lt u, j\gt\) as a noisy interaction. We only add the noise into the training set, while keeping the validation set and testing set unchanged. Taking ML-1M as an example, we denote the versions with 10% and 20% injected noisy interactions as ML-1M-10% and ML-1M-20%, respectively.

4.1.2 Compared Models.

We compare the proposed method with the following methods:

•

NGCF [49] is a graph-based CF method that largely follows the standard GCN. It additionally encodes the second-order feature interaction into the message during message passing.

•

LightGCN [16] devises a light-weight graph convolution without feature transformation and non-linear activation, which is more simple and more efficient.

•

T-CE [47] is a state-of-the-art sample re-weighting method for the robust recommendation, which uses the Truncated BCE to prune noisy interactions. It is originally designed for BCE loss only, and we extend it with CDAE [55] for better performance.

•

DeCA [51] a newly proposed robust recommender, which considers the disagreement predictions of noisy samples across different models and minimizes the KL-divergence between the two models’ predictions to enhance the robustness. We implement it based on LightGCN [16].

•

SGL [54] uses self-supervised learning to learn a more effective and robust model, which designs different graph views to mine hard negatives and denoise noisy interactions in implicit feedback. We implement SGL-ED, which is the suggested version in the original paper for comparison.

•

NCL [26] is a newly proposed neural graph CF method, which considers the neighbor of users (or items) from the two aspects of graph structure and semantic space to form the views for contrastive learning. We implement it based on LightGCN.

•

SimGCL [63] is another newly proposed GNN-based CF method, which utilizes contrastive learning as an auxiliary task. It builds different data augmentations by adding directed random noises to the representation and has made a state-of-the-art performance.

Here, we mainly consider the GNN-based methods as baselines for two reasons. First, neural graph collaborative filtering is mainly built upon GNNs and this is also the focus of this work. Second, compared to the traditional methods, such as BPRMF [38] and NeuMF [17], neural graph collaborative filtering methods have shown that they can achieve better results, as they encode the high-order information of bipartite graphs into representations [16, 49, 54].

4.1.3 Evaluation Metrics.

We evaluate the \(top-N\) recommendation performance by using two widely used metrics \(Recall@N\) and \(NDCG@N\), where \(N\) is set to 10 and 20 for consistency. Following References [16, 54], we adopt the full-ranking strategy [66], which ranks all the candidate items that the user has not interacted with.

4.1.4 Implementations.

For all the compared models, we either use the source code provided by their authors (if it exists) or implement them ourselves with RecBole [67], which is a unified open-source framework for developing and replicating recommendation algorithms. To ensure fairness, we fix the embedding size and batch size to 64 and 4,096, respectively. We optimize all the models by using Adam optimizer and the parameters are initialized by the Xavier distribution. We adopt the early stopping strategy to prevent overfitting, i.e., we stop training if the evaluation metric (e.g., NDCG@10) on the validation set does not increase for 10 epochs. For each compared method, we refer to the best hyper-parameters in their paper and then fine-tune them carefully to achieve the best results we can have (note that most of the compared methods perform even better after this step).

For the proposed RocSE, we fix \(\lambda _{2}\) to \(1e-4\), which is the same with the compared models. We tune the hyper-parameter \(\tau\) in \(\lbrace 0.05, 0.1, 0.2, 0.5, 1.0\rbrace\), \(\theta\) in \(\lbrace 0.2, 0.3, 0.4, 0.5, 0.6, 0.7\rbrace\), \(\lambda _{1}\) in \(\lbrace 0.1, 0.2, 0.5, 1.0, 2.0, 5.0 \rbrace\), and \(\epsilon\) in \(\lbrace 0.01, 0.05, 0.1, 0.2, 0.5, 1.0 \rbrace\) on the validation set.

4.2 Overall Performance

We compare the overall performance of all the methods on different datasets. The comparison results of clean datasets (the original datasets without additional noisy interactions), noisy datasets with 10% noisy interactions, and 20% noisy interactions are shown in Tables 4, 5, and 6, respectively.

Table 4.

Dataset	Metric	NGCF	LightGCN	T-CE	DeCA	SGL	NCL	SimGCL	RocSE	Improv.
ML-1M	Recall@10	0.1748	0.1822	0.1828	0.1822	0.1837	0.1987	0.2011	0.2023	0.60%
	Recall@20	0.2595	0.2748	0.2726	0.2759	0.2751	0.2976	0.3019	0.3029	0.33%
	NDCG@10	0.2461	0.2569	0.2528	0.2609	0.2587	0.2775	0.2779	0.2800	0.76%
	NDCG@20	0.2536	0.2659	0.2627	0.2689	0.2676	0.2876	0.2889	0.2910	0.73%
Yelp	Recall@10	0.0584	0.0733	0.0786	0.0723	0.0866	0.0907	0.0877	0.0908	0.11%
	Recall@20	0.0960	0.1153	0.1202	0.1162	0.1305	0.1347	0.1365	0.1368	0.22%
	NDCG@10	0.0444	0.0572	0.0653	0.0547	0.0703	0.0730	0.0697	0.0745	2.05%
	NDCG@20	0.0568	0.0709	0.0783	0.0693	0.0844	0.0872	0.0855	0.0891	2.18%
Amazon	Recall@10	0.0590	0.0775	0.0969	0.0749	0.0971	0.0917	0.0996	0.1053	5.72%
	Recall@20	0.0931	0.1184	0.1376	0.1164	0.1411	0.1354	0.1442	0.1486	3.05%
	NDCG@10	0.0450	0.0606	0.0784	0.0569	0.0779	0.0727	0.0800	0.0866	8.25%
	NDCG@20	0.0562	0.0739	0.0917	0.0706	0.0920	0.0868	0.0942	0.1005	6.69%
Gowalla	Recall@10	0.1030	0.1288	0.1255	0.1258	0.1426	0.1422	0.1439	0.1467	1.95%
	Recall@20	0.1503	0.1827	0.1776	0.1860	0.2025	0.2022	0.2063	0.2077	0.68%
	NDCG@10	0.0818	0.1030	0.1008	0.0971	0.1152	0.1145	0.1165	0.1198	2.83%
	NDCG@20	0.0966	0.1198	0.1173	0.1162	0.1338	0.1329	0.1357	0.1386	2.14%

Table 4. Performance Comparison of Different Models on the Original Datasets

Table 5.

Dataset	Metric	NGCF	LightGCN	T-CE	DeCA	SGL	NCL	SimGCL	RocSE	Improv.
ML-1M-10%	Recall@10	0.1620	0.1676	0.1811	0.1686	0.1784	0.1925	0.1963	0.1977	0.71%
	Recall@20	0.2483	0.2552	0.2709	0.2561	0.2683	0.2895	0.2936	0.2947	0.38%
	NDCG@10	0.2380	0.2457	0.2518	0.2463	0.2553	0.2712	0.2744	0.2755	0.40%
	NDCG@20	0.2457	0.2526	0.2615	0.2534	0.2629	0.2805	0.2840	0.2851	0.39%
Yelp-10%	Recall@10	0.0447	0.0577	0.0687	0.0594	0.0783	0.0779	0.0834	0.0905	8.53%
	Recall@20	0.0753	0.0930	0.1061	0.0947	0.1214	0.1182	0.1294	0.1348	4.17%
	NDCG@10	0.0339	0.0448	0.0557	0.0461	0.0632	0.0623	0.0662	0.0742	12.08%
	NDCG@20	0.0440	0.0562	0.0678	0.0576	0.0771	0.0754	0.0811	0.0884	9.00%
Amazon-10%	Recall@10	0.0407	0.0575	0.0912	0.0576	0.0896	0.0755	0.0929	0.1030	10.87%
	Recall@20	0.0660	0.0886	0.1301	0.0900	0.1312	0.1143	0.1351	0.1469	8.73%
	NDCG@10	0.0311	0.0448	0.0738	0.0454	0.0715	0.0597	0.0747	0.0854	14.32%
	NDCG@20	0.0394	0.0549	0.0865	0.0559	0.0850	0.0723	0.0883	0.0993	12.46%
Gowalla-10%	Recall@10	0.0865	0.1081	0.1204	0.1114	0.1337	0.1312	0.1361	0.1447	6.32%
	Recall@20	0.1233	0.1550	0.1718	0.1614	0.1905	0.1852	0.1936	0.2029	4.80%
	NDCG@10	0.0680	0.0876	0.0961	0.0880	0.1078	0.1054	0.1096	0.1178	7.48%
	NDCG@20	0.0794	0.1019	0.1124	0.1036	0.1253	0.1221	0.1273	0.1357	6.60%

Table 5. Performance Comparison on Datasets with 10% Noisy Interactions

The best results are in bold and the second-best results are underlined. The last column shows the relative improvement of RocSE compared to the best competitor. The proposed RocSE outperforms the existing competitors, with higher relative improvements in nearly all the cases compared to the clean setting.

Table 6.

Dataset	Metric	NGCF	LightGCN	T-CE	DeCA	SGL	NCL	SimGCL	RocSE	Improv.
ML-1M-20%	Recall@10	0.1572	0.1646	0.1779	0.1655	0.1729	0.1861	0.1931	0.1973	2.18%
	Recall@20	0.2410	0.2504	0.2658	0.2512	0.2616	0.2805	0.2919	0.2977	1.99%
	NDCG@10	0.2322	0.2411	0.2509	0.2444	0.2516	0.2668	0.2718	0.2745	0.99%
	NDCG@20	0.2393	0.2481	0.2599	0.2499	0.2595	0.2750	0.2820	0.2851	1.10%
Yelp-20%	Recall@10	0.0413	0.0499	0.0672	0.0556	0.0753	0.0701	0.0802	0.0900	12.22%
	Recall@20	0.0693	0.0817	0.103	0.0908	0.1169	0.1103	0.1250	0.1337	6.96%
	NDCG@10	0.0317	0.0386	0.0547	0.0429	0.0597	0.0560	0.0638	0.0742	16.30%
	NDCG@20	0.0408	0.0489	0.0664	0.0544	0.0732	0.0689	0.0784	0.0882	12.50%
Amazon-20%	Recall@10	0.0348	0.0495	0.0872	0.0511	0.0827	0.0666	0.0881	0.1039	17.93%
	Recall@20	0.0570	0.0776	0.1254	0.0797	0.1226	0.1004	0.1281	0.1473	14.99%
	NDCG@10	0.0271	0.0390	0.0704	0.0401	0.0668	0.0526	0.0714	0.0854	19.61%
	NDCG@20	0.0344	0.0481	0.0828	0.0493	0.0796	0.0636	0.0842	0.0992	17.81%
Gowalla-20%	Recall@10	0.0774	0.0998	0.1168	0.1016	0.1266	0.1229	0.1308	0.1425	8.94%
	Recall@20	0.1134	0.1420	0.1644	0.1477	0.1817	0.1758	0.1862	0.2017	8.32%
	NDCG@10	0.0625	0.0809	0.0933	0.0809	0.1025	0.0989	0.1054	0.1167	10.72%
	NDCG@20	0.0737	0.0939	0.1083	0.0952	0.1196	0.1152	0.1226	0.1347	9.87%

Table 6. Performance Comparison on Datasets with 20% Noisy Interactions

The best results are in bold and the second-best results are underlined. The last column shows the relative improvement of RocSE compared to the best competitor. The proposed RocSE outperforms the existing competitors. RocSE achieves higher relative improvements in nearly all the cases compared to the clean setting and the setting with 10% noisy data.

First, from Table 4, we can observe that RocSE achieves the best performance on all clean datasets and outperforms the state-of-the-art method SimGCL with a noticeable margin on the Amazon Books dataset (e.g., up to 8.25% relative improvement). Compared with NGCF and LightGCN, DeCA performs comparably, and T-CE has better performance on Yelp and Amazon datasets, but drops on other two datasets. The other three methods SGL, NCL, and SimGCL all utilize contrastive learning as an auxiliary task, which makes a significant improvement in performance. These three methods construct the views of contrastive learning in different ways. SGL augments different interaction graphs with edge dropping; NCL constructs views by using structure neighbors and semantic neighbors in the interaction graph; SimGCL builds different views by adding random representation noise. These different strategies also achieved different effects on different datasets. For example, NCL has considerable performance on ML-1M and Yelp, but drops on Amazon. A possible reason is that NCL is more suitable for datasets with denser interactions where user (item) neighbors are more informative. SGL has a good improvement in each dataset compared to the LightGCN baseline, and the overall performance of SimGCL is the best among three methods. The probable reason is that constructing a contrastive view by adding random noise in the embedding space can make the representations of users and items more uniform. It is worth noting that the second part (embedding space perturbation) of our work is similar to SimGCL, which also introduces noises for representations. The main difference of our method is that it mimics the attacking behavior by using the existing user/item embeddings as perturbations and further considers denoising in the structure space.

Second, Tables 5 and 6 show the results where extra noisy interactions are injected into the training data. We can first observe that RocSE still achieves improvements on all the datasets compared to the existing neural graph collaborative filtering methods. For example, on the Amazon dataset, RocSE improves the best competitor (SimGCL) by up to 14.32% and 19.61% on the NDCG@10 metric when 10% and 20% noise are randomly injected, respectively. Additionally, RocSE achieves higher relative improvements in nearly all the cases when noises are injected compared to the clean setting, and the relative improvements are further enlarged with more noises. This means that our RocSE is more robust compared to the existing methods when there are more noises in the data. Moreover, observing the performance of RocSE across the datasets, we find that RocSE tends to be more effective on sparser and larger datasets, which are more common in the real world. We conjecture that there are two possible reasons for this observation. First, denser datasets contain more reliable information, making them more robust against noises. This is also supported by the observation that all the methods perform relatively better on the ML-1M dataset, which is extremely denser (at least around \(30\times\) denser than the other three datasets). Second, for a large and sparse interaction graph, noisy interactions may have a greater impact on the neighbor nodes. Considering the case when most of the edges for a given node are noisy interactions, it would be extremely difficult to make correct predictions for this given node. As to the existing methods, we can also observe that contrastive learning mitigates the effect of noise to a certain extent. For example, SimGCL, NCL, and SGL all outperform LightGCN with a relatively large margin. Among them, SimGCL also seems to be more effective under the interference of noise compared with SGL and NCL. But still, they are less effective than RocSE.

To more intuitively show the negative impact of noisy interactions, we compare the performance degradation of all methods after adding noises. As shown in Table 7, we summarize the drop points and drop rates of the NDCG@10 metric for all the compared methods after adding 10% and 20% noisy interactions. By comparing the results, we can easily find that almost all methods, especially LightGCN and NGCF, have a cliff-like decline of their performance with the increase of additional noisy interactions. Specifically, when adding 20% noisy interacions, both NGCF and LightGCN have a more than 20% performance drop on Yelp, Amazon Books, and Gowalla. An important reason may be that the message-passing mechanism of GNN exacerbates the negative effects of noisy interactions, which is also why we need to pay close attention to the model robustness of GNN-based collaborative filtering. With a dual denoising scheme, our proposed method successfully maintains performance degradation below 3% on all datasets, which is far better than the other methods in the latter three datasets. T-CE performs relatively well in terms of performance degradation on the ML-1M dataset. The probable reason is that the backbone CDAE [55] used by T-CE is more stable for dense datasets (e.g., ML-1M) compared with LightGCN. Still, the performance of T-CE significantly drops for sparse datasets.

Table 7.

Dataset	Methods	Drop point (10%)	Drop rate (10%)	Drop point (20%)	Drop rate (20%)
ML-1M	NGCF	0.0081	3.29%	0.0139	5.65%
	LightGCN	0.0112	4.36%	0.0158	6.15%
	T-CE	0.0010	0.40%	0.0019	0.75%
	DeCA	0.0146	5.60%	0.0165	6.32%
	SGL	0.0034	1.31%	0.0071	2.74%
	NCL	0.0063	2.27%	0.0107	3.86%
	SimGCL	0.0035	1.26%	0.0061	2.2%
	RocSE	0.0045	1.61%	0.0055	1.96%
Yelp	NGCF	0.0105	23.65%	0.0127	28.6%
	LightGCN	0.0124	21.68%	0.0186	32.52%
	T-CE	0.0096	14.70%	0.0106	16.23%
	DeCA	0.0086	15.72%	0.0118	21.57%
	SGL	0.0071	10.1%	0.0106	15.08%
	NCL	0.0107	14.66%	0.0170	23.29%
	SimGCL	0.0035	5.02%	0.0059	8.46%
	RocSE	0.0003	0.40%	0.0004	0.40%
Amazon	NGCF	0.0139	30.89%	0.0179	39.78%
	LightGCN	0.0158	26.07%	0.0216	35.64%
	T-CE	0.0046	5.87%	0.0080	10.20%
	DeCA	0.0115	20.21%	0.0168	29.52%
	SGL	0.0064	8.22%	0.0111	14.25%
	NCL	0.0130	17.88%	0.0201	27.65%
	SimGCL	0.0053	6.62%	0.0086	10.75%
	RocSE	0.0012	1.39%	0.0012	1.39%
Gowalla	NGCF	0.0138	16.87%	0.0193	23.59%
	LightGCN	0.0154	14.95%	0.0221	21.46%
	T-CE	0.0047	4.66%	0.0075	7.44%
	DeCA	0.0091	9.37%	0.0162	16.68%
	SGL	0.0074	6.42%	0.0127	11.02%
	NCL	0.0091	7.95%	0.0156	13.62%
	SimGCL	0.0069	5.92%	0.0111	9.53%
	RocSE	0.0020	1.67%	0.0031	2.59%

Table 7. Performance Drop of All Methods after Adding Noisy Interactions

The proposed RocSE is significantly better in most cases and it successfully maintains performance degradation below 3% in nearly all cases.

We next further examine whether RocSE can truly identify noisy interactions. For this purpose, we test whether the trained model can successfully distinguish noisy interaction samples by calculating the prediction scores (i.e., Equation (6)) for noisy and clean interactions separately. The prediction score for the interaction reflects how well the model adapts to the interaction, and a higher prediction score indicates that the model has higher confidence considering the corresponding interaction as a clean sample. The results of LightGCN and RocSE on datasets with 20% noisy interactions are shown in Figure 3, where the blue boxes and orange boxes represent clean and noisy samples, respectively. As can be seen from the figure, compared with LightGCN, the prediction scores of clean samples and noise samples by RocSE have a more obvious difference, which indicates that our method can distinguish noise samples from clean samples more effectively. For example, on the Amazon-20% dataset, the prediction scores of our method for the additional noisy interactions are kept in a very low range, which is consistent with the excellent performance of RocSE in Table 6.

Fig. 3.

In summary, to answer RQ1 , the proposed RocSE outperforms the state-of-the-art neural graph collaborative filtering methods, and such improvements are further enlarged when there are injected noisy interactions in the training data. Furthermore, RocSE performs especially better compared to the existing methods when the datasets are large and sparse, which is the usual case in practice.

4.3 Further Analysis of RocSE

4.3.1 Performance Gain of RocSE.

In this part, we analyze the performance gain of RocSE. Specifically, to verify whether our proposed two modules are respectively effective, we conduct separate experiments for each module. We remove the structure denoising module, the embedding space perturbation module, and both of them, respectively, and then show their performance in noisy datasets. We display the results in Figure 4, where “w/o SD” represents RocSE without the structure denoising module, “w/o EP” represents RocSE without the embedding space perturbation module, and “w/o both” represents the baseline. In the figure, we only show the Recall@10 and NDCG@10 results on the polluted datasets, as the proposed RocSE is more effective on such real-world datasets.

Fig. 4.

As shown in Figure 4, we can find that discarding each of the two modules will lead to a decline in model performance, and dropping the EP module leads to a greater performance drop. For example, on the Amazon-20% dataset, compared with the baseline, RocSE without EP module, RocSE without SD module, and RocSE have increased metric NDCG@10 by 10.77%, 81.54%, and 118.97%, respectively. Observe that the EP module has made a greater improvement. One of the reasons is that we have combined it with contrastive learning, making full use of unsupervised signals. Still, the improvement brought by the SD module is also necessary. For example, on the Yelp-20% and Amazon-20% datasets, RocSE has increased the metric NDCG@10 by 16.67% and 20.62%, respectively, compared with RocSE without SD module. Overall, this result means that both two modules are effective in terms of improving the effectiveness against noisy interactions.

4.3.2 Other Design Choices of Two Modules.

To further verify the effectiveness of the proposed two modules (i.e., SD and EP), we replace them with other existing design choices and conduct further ablation experiments. For the SD module, we first replace the cleanness score computation (Equation (10)) with the straightforward way that computes the cosine similarity between the user embedding and the item embedding for a given interaction, denoted as “RocSE SD v1”; inspired by T-CE [47], we then use the loss value in the early stage to identify noisy interactions, denoted as “RocSE SD v2”; we also test an ideal case that we know the injected noisy interactions and remove them all to replace the SD module, denoted as “RocSE SD v3.” For the EP module, we consider three different choices. First, we discard the shuffled embeddings, denoted as “RocSE EP v1”; second, we replace the shuffled embeddings with uniform random noise that is used in SimGCL [63], denoted as “RocSE EP v2”; third, we discard the EP module and use the edge dropping strategy from SGL [54] to construct contrastive learning views, denoted as “RocSE EP v3.” We re-tune the hyper-parameters for all the variants for a fair comparison. The results on noisy datasets with 20% noisy interactions are reported in Table 8.

Table 8.

Methods	Metrics	ML-1M-20%	Yelp-20%	Amazon-20%	Gowalla-20%
RocSE SD v1	Recall@10	0.1900	0.0842	0.0959	0.1401
RocSE SD v1	NDCG@10	0.2635	0.0677	0.0762	0.1141
RocSE SD v2	Recall@10	0.1793	0.0807	0.0899	0.1394
RocSE SD v2	NDCG@10	0.2550	0.0650	0.0727	0.1137
RocSE SD v3	Recall@10	0.1999	0.0865	0.0971	0.1436
RocSE SD v3	NDCG@10	0.2758	0.0682	0.0779	0.1167
RocSE EP v1	Recall@10	0.1446	0.0769	0.0805	0.1350
RocSE EP v1	NDCG@10	0.1952	0.0620	0.0648	0.1100
RocSE EP v2	Recall@10	0.1969	0.0840	0.0963	0.1387
RocSE EP v2	NDCG@10	0.2732	0.0694	0.0811	0.1129
RocSE EP v3	Recall@10	0.1956	0.0808	0.0909	0.1318
RocSE EP v3	NDCG@10	0.2710	0.0641	0.0743	0.1061
RocSE	Recall@10	0.1973	0.0900	0.1039	0.1425
RocSE	NDCG@10	0.2745	0.0742	0.0854	0.1167

Table 8. The Results of Other Design Choices of the Two Modules in RocSE

RocSE performs better than the other design choices. Note that “RocSE SD v3” represents the ideal case that we can directly remove the injected noisy interaction. We can see the RocSE achieves comparable (sometimes even better) results with this ideal case.

We can observe that the proposed SD and EP modules in RocSE generally perform better than the other design choices. RocSE is better than “RocSE SD v1” and “RocSE SD v2,” meaning that using our cleanness score for structure denoising is better than directly using the similarity of user/item embeddings or the loss value in the early stage. To better show the results, we also compare the embedding similarities of both clean (original) interactions and injected noisy interactions during the training process, and the results on the Yelp-20% dataset are shown in Figure 5. We record the mean (middle line) and standard deviation (shaded part) of the similarity scores. The left and right figures correspond to directly using user/item embeddings (“RocSE SD v1”) and using our method in Equations (7) and (8) to calculate the similarity, respectively. As we can see, our method can easily distinguish between clean and noisy interactions in the graph, which is much better than directly using user/item embeddings. Note that “RocSE SD v3” represents the ideal case that we can directly remove the injected noisy interaction. We can see the RocSE achieves comparable (sometimes even better) results with this ideal case. This is due to the fact that these datasets themselves contain potential noisy interactions, and RocSE can mitigate their negative effects. Similarly, RocSE is also better than all the variants with different EP design choices, which demonstrates that our EP module with random shuffling is more conducive to improving the robustness of the model against noisy interactions.

Fig. 5.

4.3.3 RocSE with Other Backbones.

In RocSE, we build our method upon the LightGCN model. However, it is unclear if our method is also applicable to other models. To answer this question, we next investigate RocSE on the NGCF model. We utilize NGCF as the GNN encoder and then verify the effectiveness of each module and their combination on three datasets: Gowalla-20%, Yelp-20%, and Amazon-20%. The results are reported in Table 9, where “NGCF w SD,” “NGCF w EP,” and “NGCF w Both” represent the NGCF with our proposed structure denoising module, embedding space perturbation module, and both, respectively. As we can see from the table, both the proposed structure denoising module and the embedding space perturbation module improve the performance of NGCF, although the improvement is smaller than that of LightGCN. The possible reason is that LightGCN is easier to optimize with a simpler structure.

Table 9.

Methods	Metrics	Gowalla-20%	Yelp-20%	Amazon-20%
NGCF	Recall@20	0.1134	0.0690	0.0570
NGCF	NDCG@20	0.0737	0.0408	0.0344
NGCG w SD	Recall@20	0.1178	0.0706	0.0590
NGCG w SD	NDCG@20	0.0776	0.0418	0.0353
NGCF w EP	Recall@20	0.1356	0.0873	0.0694
NGCF w EP	NDCG@20	0.0893	0.0522	0.0419
NGCF w Both	Recall@20	0.1439	0.8991	0.0758
NGCF w Both	NDCG@20	0.0946	0.0547	0.0453

Table 9. Performance of Using NGCF [49] Instead of LightGCN [16] as the Backbone

Both modules of RocSE are still effective under noises.

In summary, to answer RQ2 , both the structure denoising module and the embedding space perturbation module can improve the proposed RocSE in terms of performing against noisy interactions. The proposed two modules also perform better than several existing design choices. Additionally, such results hold if we switch to other neural graph collaborative filtering methods as the backbone.

4.4 Parameter Sensitivity

In this part, we analyze the impact of four important hyper-parameters (i.e., \(\tau\), \(\lambda _{1}\), \(\epsilon\), and \(\theta\)) in RocSE. For simplicity, we only show the Recall@10 and NDCG@10 results on datasets with 20% noisy interactions, and the results are similar in other cases.

4.4.1 Hyper-parameter τ.

\(\tau\) in Equations (15) and (16) is the softmax temperature coefficient. We fix other hyper-parameters and then tune it in the interval [0.05, 1.0]. As shown in Figure 6, the model performance is relatively stable when \(\tau\) varies in a large range, and the best performance is achieved when \(\tau\) is set as 0.2.

Fig. 6.

4.4.2 Hyper-parameter λ₁.

\(\lambda _1\) in Equation (18) is a regularization coefficient controlling the relative weights of contrastive learning loss. We first fixed the other two hyper-parameters to a stable constant value and then fine-tuned \(\lambda _{1}\) in the interval [0.1, 5.0], as discussed in the experimental setup. The results are shown in Figure 7. As we can see, the performance of the model increases with the increase of the \(\lambda _{1}\) value at first, and then starts to decrease after a certain value. The best values on the ML-1M-20%, Yelp-20%, Amazon-20%, and Gowalla-20% are 0.2, 1.0, 2.0, and 1.0, respectively. In general, larger \(\lambda _{1}\) values are required to achieve the best performance on larger datasets.

Fig. 7.

4.4.3 Hyper-parameter ε.

The hyper-parameter \(\epsilon\) controls the perturbation magnitude, as shown in Equation (13). Through the same procedure, we obtained the tuning results of \(\epsilon\), and the results are shown in Figure 8. Similar to the results for \(\lambda _{1}\), the performance of the model first increases and then decreases as \(\epsilon\) increases. We think that when the value of \(\epsilon\) is too large or too small, it is difficult for the model to learn the most essential useful information from the constructed noisy views of contrastive learning. We can observe that RocSE achieves the best performance when epsilon is close to 0.1 from the figures. Compared to \(\lambda _{1}\), the model is less sensitive to the changes in \(\epsilon\), especially around the optimal value.

Fig. 8.

4.4.4 Hyper-parameter θ.

The final hyper-parameter is the denoise threshold \(\theta\) in Equation (11). We fix \(\lambda _{1}\) and \(\epsilon\) and then adjust the value of \(\theta\) to observe the change of model performance. The results are reported in Figure 9. Unlike the previous two hyper-parameters, the performance curve of the model is smooth in the first long period, and then the model performance drops sharply when the value of \(theta\) reaches a critical value. This may be due to the fact that \(\theta\) acts as a threshold for deciding whether to apply hard denoising strategies (directly dropping edges in the interaction graph) in the structure denoising module; therefore, the model performance changes significantly only when the value is close to the critical value that can distinguish between noisy and clean interacting edges. It is worth noting that on the datasets Yelp-20% and Amazon-20%, the performance improvement of the model is more obvious by adjusting the value of \(\theta\). When the value of \(\theta\) changes from 0.5 to 0.6, the Recall@10 and NDCG@10 metrics on Amazon-20% have 5.68% and 7.19% improvements, respectively. We believe this is because the Amazon Books dataset is larger and sparser, which would lead to more potential noisy interactions in the user-item graph.

Fig. 9.

In summary, to answer RQ3 , the proposed RocSE performs relatively stable, as the four hyper-parameters vary in a relatively wide range.

4.5 Efficiency

Finally, we evaluate the efficiency aspect of the proposed method. We compare the actual training time of the proposed RocSE with LihgtGCN and SGL. The results are shown in Table 10, where all the results are collected on an Intel(R) Xeon(R) Silver 4110 CPU and a GeForce RTX 2080 GPU.

Table 10.

Methods	Component	Gowalla-20%(s)	Yelp-20%(s)	Amazon-20%(s)
LightGCN	Single epoch time	11.23 (1 \(\times\))	30.59 (1 \(\times\))	64.67 (1 \(\times\))
	Total training epochs	100 (1 \(\times\))	171 (1 \(\times\))	143 (1 \(\times\))
	Total training time	1,123.00 (1 \(\times\))	5,230.89 (1 \(\times\))	9,247.81 (1 \(\times\))
SGL	Single epoch time	58.53 (5.22 \(\times\))	148.65 (4.86 \(\times\))	316.72 (4.90 \(\times\))
	Total training epochs	43 (0.43 \(\times\))	57 (0.33 \(\times\))	45 (0.32 \(\times\))
	Total training time	2,516.79 (2.24 \(\times\))	8,473.05 (1.62 \(\times\))	14,252.40 (1.54 \(\times\))
RocSE	Single epoch time	49.85 (4.44 \(\times\))	130.17 (4.26 \(\times\))	272.86 (4.22 \(\times\))
	Total training epochs	47 (0.47 \(\times\))	57 (0.33 \(\times\))	41 (0.29 \(\times\))
	Total training time	2,342.95 (2.09 \(\times\))	7,419.69 (1.42 \(\times\))	11,187.26 (1.21 \(\times\))

Table 10. The Training Time Comparison

The proposed RocSE incurs 21%–109% extra computational cost compared with LightGCN and runs even faster than SGL.

As shown in Table 10, we count the training time of every single epoch, the number of epochs to converge under the same early stopping strategy, and the total training time for all methods. We also calculate the multiple of RocSE’s training time compared to LightGCN. We can observe from the table that the training time per epoch of our method is surprisingly reduced compared to SGL. The reason is that there is no reliance on graph structure augmentation to construct comparative learning views in our method. Compared with the LightGCN, although the training time in each epoch is three times slower, RocSE only takes 42% and 21% extra training time on the larger Yelp-20% and Amazon-20% datasets. This is mainly due to the faster convergence speed of our method. Considering the benefit brought in the effectiveness and robustness aspects, such extra computational cost is affordable in practice.

In summary, to answer RQ4 , although additional denoising and perturbation modules are adopted, the proposed RocSE only incurs affordable extra computational cost compared with LightGCN. It runs even faster than SGL.

5 Related Work

In this section, we briefly review the related work. Specifically, we group the related work into four lines: neural graph collaborative filtering, noise in recommender systems, the robustness of recommender systems, and the difference between our work and existing methods.

5.1 Neural Graph Collaborative Filtering

Compared to early studies [1, 13] that propagate user preferences on the graph with random walks, neural graph collaborative filtering has received more recent attention [46]. Different from traditional collaborative filtering methods, neural graph collaborative filtering methods model users’ preferences by utilizing the graph structure information of user-item interactions. Generally, it employs the recent progress of graph neural networks (e.g., GCN [23]) to learn the complex relations within the user-item interactions [6, 16, 27, 29, 37, 42, 49, 50, 54, 60, 62, 64, 68]. For example, NGCF [49] obtains high-level collaborative signals of users and items with the message propagation in the user-item graph. LCF [64] removes noises and improves the efficiency of graph convolution for the recommendation. LightGCN [16] discards transformation matrices and nonlinear activation functions in GCN, making it more simple and more effective. EGLN [60] proposes to simultaneously learn the user/item embeddings and the graph structure in a mutual way. More recently, some self-supervised learning methods for neural graph collaborative filtering have emerged. In particular, some methods consider contrastive learning as an auxiliary task and have achieved remarkable results. For example, SGL [54] constructs different contrastive learning views by performing random dropout augmentation on the graph structure. NCL [26] designs a prototypical contrastive objective to capture the correlations between a user/item and its prototype. SimGCL [63] directly adds random uniform noise to the representation for data augmentation. In addition to the above work, GNN-based recommender systems have also been applied in some other scenarios. For example, GNNs have been used to model social network and user interactions in social recommendation [7, 32, 61], to model the graph transformed from the sequence of user behaviors in sequential recommendation [2, 3, 19], and to model the hypergraph of high-order item relations in session-based recommendation [45, 57].

Despite the effectiveness of recent neural graph collaborative filtering methods, the robustness aspect remains largely unexplored, especially considering that GNNs themselves are shown to be vulnerable to adversarial attacks [59, 69, 70, 71] and there exist ubiquitous noisy interactions in the feedback of recommender systems. In this work, we mainly focus on improving the robustness of neural graph collaborative filtering against noises and meanwhile maintaining its effectiveness.

5.2 Noise in Recommender Systems

Existing recommender systems are typically trained with implicit feedback (e.g., viewing a movie or clicking a picture) due to the large volume. Generally, we view the existing historical interactions in implicit feedback as positive samples, while unobserved interactions as negative samples. However, previous studies [20, 28, 53] have noted the prevalence of noisy interactions (named false-positive interactions) in the observed interactions, which cannot reflect the actual user satisfaction. For example, in E-commerce, a large portion of click behaviors of users are triggered by curiosity, which does not directly indicate a positive user perception of the products. Most implicit interactions are easily influenced by the first impression of the users and other factors [5, 48] such as caption bias [18] and position bias [22]. Moreover, Wen et al. [53] have proved the detrimental effect of such false-positive interactions on the user experience of online services. However, unobserved interactions may attribute to the unawareness of users, because the items are simply not exposed to them. As a result, there are also potential positive samples (false-negative interactions) in unobserved interactions that are mixed with truly negative interactions [35, 38]. EGLN [60] has tried to solve this problem based on enhanced graph learning.

In a nutshell, directly using implicit feedback without considering the noise factor tends to obtain a sub-optimal recommender system, which is unable to understand the real preferences of users [47, 58]. In this work, we mainly consider the false-positive samples as noisy interactions in the user-item graph.

5.3 Robustness of Recommender Systems

Recently, significant attention has been dedicated to the robustness of recommender systems due to their vulnerable property to noisy interactions [11, 12, 24, 47, 51, 52, 65]. To build more robust recommender systems, some auto-encoder methods [24, 41, 55] introduce denoising techniques by corrupting the interactions of users with random noises and then trying to reconstruct the original one with auto-encoders. However, some methods have been dedicated to directly reducing the negative impact of noisy interactions in implicit feedback that can be categorized into sample selection methods [11, 52, 65] and sample re-weighting methods [47, 51]. Sample selection methods tend to denoise implicit feedback by selecting clean and informative samples only. For example, WBRP [11] considers that missing interactions of popular items are likely to be true negative examples and thus assigns them higher sampling probabilities. IR [52] interactively generates pseudo-labels for user preferences based on the difference between labels and predictions to discover noise-positive and noise-negative examples. Sample re-weighting methods tend to distinguish noisy interactions from clean data based on loss values and predictions in the training process. For example, T-CE [47] tries to assign lower weights to high-loss samples, especially in the early stage. DeCA [51] considers that different models tend to make similar predictions on truly clean interactions and develops an ensemble method by minimizing the KL-divergence between two model predictions. Although these methods achieve promising results, they usually depend on specific models or loss values and are sub-optimal or not applicable to neural graph collaborative filtering, as they do not consider the noise diffusion. More recently, denoising has received close attention. For example, SGCN [4] reduces the negative effects of noise via a stochastic binary mask. SGL [54] enhances the robustness against interaction noises with self-supervised learning by employing graph structure augmentations. However, they only consider augmenting the structure of the interaction graph with dropping/masking operations, while not considering how to improve the robustness of neural graph collaborative filtering in a more comprehensive perspective.

5.4 Difference with Existing Work

Compared with previous studies [47, 51, 52], our RocSE is tailored for enhancing the robustness of neural graph collaborative filtering. Considering that the message-passing scheme of GNNs may be more vulnerable to noisy interactions, we conduct structural denoising on the interaction graph and directly affect the information propagation of noisy interactions in GNNs to mitigate the negative effects of noise diffusion. Unlike existing methods [4, 54] for building robust models, we not only utilize edge dropping as a hard structure denoising strategy but also consider re-weighting interaction edges as a soft strategy. Different from References [47, 51], which identify the noisy interactions by considering loss value or disagreement of different models, we conduct structure denoising based on the similarity computed upon the neighborhood, which is simple yet has been proved to be effective. To the best of our knowledge, this work is also the first work that attempts to enhance model robustness from a dual perspective of graph structure denoising and embedding space perturbation for neural graph collaborative filtering.

6 Conclusions

In this article, we have presented a new neural graph collaborative filtering method RocSE, intending to improve the model robustness. RocSE accomplishes this aim from the perspectives of both structure space and embedding space. That is, RocSE first denoises the graph structure by identifying more reliable interactions based on the neighborhood similarity computation and further uses the measured cleanness score to affect the message propagation process of the backbone GNNs. RocSE then introduces in-distribution perturbations in the embedding space by mimicking the behavior of adversarial attacks and further adopts contrastive learning to constrain the user/item embeddings. Extensive experimental evaluations show the effectiveness and efficiency of the proposed method. RocSE improves the current state-of-the-art methods in recommendation accuracy, and the improvements are further enlarged when there are intentionally injected noisy interactions in the training data. In the future, we will further consider expanding the proposed strategies to recommender systems in other scenarios, such as session-based recommendation and sequential recommendation. In addition, further exploring the robustness under other different noises (such as adversarial noises) is one of the future directions worth considering.

Footnote

https://www.yelp.com/dataset.

A Time Complexity of RocSE

A.1 Structure Denoising

In the beginning of each training epoch, we re-calculate the weight of all observed interactions for structure denoising based on Equations (7)–(11).

Step 1.1: First, we calculate the embeddings for cleanness score in Equations (15) and (16), which is equivalent to calculate \({\bf E}^{(1)}\) and \({\bf E}^{(2)}\) as below:

\begin{equation} \mathbf {E}^{(1)}=\left(\mathbf {D}^{-\frac{1}{2}} \mathbf {A} \mathbf {D}^{-\frac{1}{2}}\right) \mathbf {E}^{(0)}, \mathbf {E}^{(2)}=\left(\mathbf {D}^{-\frac{1}{2}} \mathbf {A} \mathbf {D}^{-\frac{1}{2}}\right) \mathbf {E}^{(1)}. \end{equation}

(19)

The adjacency matrix \(\mathbf {A}\) is a sparse matrix with \(2 * |\mathcal {E}|\) elements, \(\mathbf {E}^{(0)}\) is the initial embedding matrix with embedding size \(d\). The complexity of Step 1.1 is \(\mathcal {O}(2|\mathcal {E}|d) + \mathcal {O}(2|\mathcal {E}|d) = \mathcal {O}(4|\mathcal {E}|d)\).

Step 1.2: Then, we calculate the cleanness score for all observed interactions based on Equations (9)–(10).

The complexity of Equations (9) and (10) is \(\mathcal {O}(3d)\) and \(\mathcal {O}(1)\), respectively. So, the whole complexity of Step 1.2 is \(\mathcal {O}(2|\mathcal {E}|d) + \mathcal {O}(|\mathcal {E}|) = \mathcal {O}(|\mathcal {E}|(3d+1))\).

Step 1.3: Next, we calculate the final denoised weight for all observed interactions based on Equation (11). The complexity of this step is \(\mathcal {O}(|\mathcal {E}|)\).

In conclusion, the complexity of structure denoising in each epoch is \(\mathcal {O}(4|\mathcal {E}|d) + \mathcal {O}(|\mathcal {E}|(2d+1)) + \mathcal {O}(|\mathcal {E}|)=\mathcal {O}(|\mathcal {E}|(6d+2))\), and the whole training complexity of it is \(\mathcal {O}(7|\mathcal {E}|ds+2|\mathcal {E}|s)\).

A.2 Adjacency Matrix

Before training starts, we need to compute the original adjacency matrix of the interaction graph; besides, in each training epoch, we need to re-calculate the adjacency matrix of denoised graph with denoised weights.

Step 2.1: Calculate the adjacency matrix of interaction graph \(\mathbf {A}\). As \(\mathbf {A}\) is a sparse matrix with \(2*|\mathcal {E}|\) elements, so the complexity is \(\mathcal {O}(2|\mathcal {E}|)\).

Step 2.2: Calculate the denoised adjacency matrix \(\hat{\mathbf {A}}\). Since the hard strategy in the structure denoising process may directly drop some observed interactions, assuming that we have dropped \(1-\rho ^{\prime }\) interactions, then \(\hat{\mathbf {A}}\) is a sparse matrix with \(2 * \rho ^{\prime } * |\mathcal {E}|\) elements, so the complexity is \(\mathcal {O}(2 \rho ^{\prime }|\mathcal {E}|)\).

In conclusion, we calculate Step 2.1 once before training and calculate Step 2.2 once in each training epoch, so the whole complexity of this part is \(\mathcal {O}(2|\mathcal {E}|) + \mathcal {O}(2 \rho ^{\prime }|\mathcal {E}|s) = \mathcal {O}(2|\mathcal {E}| + 2 \rho ^{\prime }|\mathcal {E}|s)\).

A.3 Graph Convolution

In the training process, we obtain the node representation with graph convolution.

Step 3.1: First, we need to calculate the node representation for BPR loss in Equation (12). As our calculation is based on the denoised graph, whose adjacency matrix \(\hat{\mathbf {A}}\) is a sparse matrix with \(2 * \rho ^{\prime } * |\mathcal {E}|\) elements, so the complexity of graph convolution after \(L\) GNN layers is \(\mathcal {O}(2\rho ^{\prime }|\mathcal {E}| Ld)\)

Step 3.2: Then, we also need to calculate the node representation after embedding perturbation for contrastive learning loss in Equation (17). Ignoring the time consumption of the embedding shuffle operation, the graph convolution process is the same as Step 3.1. As we need two random contrastive views, so the complexity is twice that of Step 3.1: \(\mathcal {O}(4\rho ^{\prime }|\mathcal {E}| Ld)\).

As we need to compute the node presentation in each training batch, so the whole computing time is \(s * \frac{|\mathcal {E}|}{B}\). In conclusion, the complexity of this part is \(\mathcal {O}(2\rho ^{\prime }|\mathcal {E}| Lds\frac{|\mathcal {E}|}{B}) + \mathcal {O}(4\rho ^{\prime }|\mathcal {E}| Lds\frac{|\mathcal {E}|}{B})= \mathcal {O}(6\rho ^{\prime }|\mathcal {E}| Lds\frac{|\mathcal {E}|}{B})\).

A.4 BPR Loss

We calculate the BPR loss as the main supervised loss function based on Equation (12). As the time consumption is mainly in calculating the inner product of interaction embeddings(\(\mathcal {O}(d)\)), and there are \(|\mathcal {E}|\) positive interaction samples and \(|\mathcal {E}|\) negative interactions, so the time complexity is \(\mathcal {O}(2|\mathcal {E}|ds)\).

A.5 Contrastive Learning Loss

We implement InfoNCE Loss as our contrastive learning loss function. The time complexity of calculating positive pairs in numerator is \(\mathcal {O}(|\mathcal {E}|ds)\); as we only consider the negative pairs in a mini batch, so the time complexity of calculating negative pairs in denominator is \(\mathcal {O}(|\mathcal {E}|Mds)\), where \(M\) is the number of nodes in a training batch. In conclusion, the complexity of this part is \(\mathcal {O}(|\mathcal {E}|ds + |\mathcal {E}|Mds)\).

Acknowledgment

We thank the anonymous reviewers for their helpful comments.

References

[1]

Shumeet Baluja, Rohan Seth, Dharshi Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. 2008. Video suggestion and discovery for YouTube: taking random walks through the view graph. In Proceedings of the 17th International Conference on World Wide Web. 895–904.

Abstract

1 Introduction

2 Preliminary

2.1 Notations

2.2 Neural Graph Collaborative Filtering

3 The Proposed Approach

3.1 Structure Denoising Module

3.1.1 Cleanness Score Computation.

3.1.2 Denoising the Graph Structure.

3.1.3 Training on the Denoised Graph.

3.2 Embedding Space Perturbation Module

3.2.1 Embedding Perturbation.

3.2.2 Mutual Information Maximization.

3.3 Training Algorithm and Analysis

3.3.1 Complexity Analysis.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

4.1.2 Compared Models.

4.1.3 Evaluation Metrics.

4.1.4 Implementations.

4.2 Overall Performance

4.3 Further Analysis of RocSE

4.3.1 Performance Gain of RocSE.

4.3.2 Other Design Choices of Two Modules.

4.3.3 RocSE with Other Backbones.

4.4 Parameter Sensitivity

4.4.1 Hyper-parameter τ.

4.4.2 Hyper-parameter λ1.

4.4.3 Hyper-parameter ε.

4.4.4 Hyper-parameter θ.

4.5 Efficiency

5 Related Work

5.1 Neural Graph Collaborative Filtering

5.2 Noise in Recommender Systems

5.3 Robustness of Recommender Systems

5.4 Difference with Existing Work

6 Conclusions

Footnote

A Time Complexity of RocSE

A.1 Structure Denoising

A.2 Adjacency Matrix

A.3 Graph Convolution

A.4 BPR Loss

A.5 Contrastive Learning Loss

Acknowledgment

References

Cited By

Index Terms

Recommendations

Towards Robust Graph Neural Networks for Noisy Graphs with Sparse Labels

Robust graph embedding via Attack-aid Graph Denoising

Graph Structure Learning for Robust Graph Neural Networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations

4.4.2 Hyper-parameter λ₁.