Adaptive graph diffusion networks: compact and expressive GNNs with large receptive fields

Sun, Chuxiong; Zhang, Muhan; Hu, Jie; Gu, Hongming; Chen, Jinpeng; Yang, Mingchuan

doi:10.1007/s10462-025-11114-z

Adaptive graph diffusion networks: compact and expressive GNNs with large receptive fields

Open access
Published: 25 January 2025

Volume 58, article number 107, (2025)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Adaptive graph diffusion networks: compact and expressive GNNs with large receptive fields

Download PDF

Chuxiong Sun¹,
Muhan Zhang²,
Jie Hu¹,
Hongming Gu¹,
Jinpeng Chen³ &
…
Mingchuan Yang¹

601 Accesses
Explore all metrics

Abstract

Graph neural networks (GNNs) are widely used in graph-based tasks, but deep GNNs often suffer from oversmoothing. Existing effective deep GNNs have various shortcomings including redundant complexity, oversimplified architecture, or predefined parameters. To address these issues, we propose adaptive graph diffusion networks (AGDNs), a class of compact and expressive GNNs that can effectively leverage deep neighborhood information. We introduce hopwise attention and hopwise convolution with positional embeddings for learning nodewise and channelwise hop weights, respectively, which overcomes oversmoothing and ensures a powerful ability to learn arbitrary filters in the spectral domain. Our experiments demonstrate that AGDNs can effectively learn various filters on images and exhibit superior performance on diverse and challenging open graph benchmark datasets for node classification and link prediction tasks while maintaining moderate complexity and fast running time.

Adaptive Randomized Graph Neural Network Based on Markov Diffusion Kernel

Deep Graph Convolutional Networks Based on Contrastive Learning: Alleviating Over-smoothing Phenomenon

Attention-enabled adaptive Markov graph convolution

Article 23 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Graph neural networks (GNNs) have achieved great success in modeling diverse graph data types, such as citation networks (Kipf and Welling 2017; Vaswani et al. 2017; Klicpera et al. 2019a), social networks (Hamilton et al. 2017; Chen et al. 2018; Rossi et al. 2020), and biological graphs (Fout et al. 2017; Li et al. 2019, 2018, 2021). Fundamental tasks addressed by GNNs include semisupervised node classification (Defferrard et al. 2016; Kipf and Welling 2017; Hamilton et al. 2017; Vaswani et al. 2017; Li et al. 2019, 2020; He et al. 2021; Li et al. 2021; Chen et al. 2020; Yu et al. 2022) and link prediction (Kipf and Welling 2016; Wang et al. 2021).

In semisupervised node classification, GNNs predict the class of unlabeled nodes by using a limited set of labeled nodes and neighborhood information. Link prediction, on the other hand, is about inferring potential edges in incomplete graphs, which can also form a self-supervised learning paradigm when done with masked edges. Despite its potential, self-supervised graph learning (Kipf and Welling 2016; Pan et al. 2018; Hou et al. 2022) often falls behind supervised methods when labels are available.

GNNs often incorporate successful techniques from other areas. For instance, attention mechanisms from transformer models (Vaswani et al. 2017) are widely employed. Graph attention networks (GATs) (Velickovic et al. 2018) utilize attention scores to weigh the importance of neighbors. However, GAT still suffers from the oversmoothing issue (Li et al. 2018), a significant challenge to deep GNNs. Techniques from computer vision, such as convolution mechanisms (LeCun et al. 1998), have also been applied to GNNs for adaptive multihop feature aggregation (Zhang et al. 2018; Chien et al. 2021).

The oversmoothing problem, a common issue with deep GNNs, arises when multiple layers are stacked to capture complex neighborhood information, leading to indistinguishable node features. This issue is often attributed to the nature of GNNs, which tend to amplify components with low eigenvalues associated with more uniform graph signals. Consequently, as illustrated in Fig. 1, high-frequency components are attenuated, yielding a more uniform output. Over iterations, the lowest-frequency component increasingly dominates the output, causing oversmoothing.

To alleviate the oversmoothing issue pervasive in deep GNNs, various solutions have been proposed. However, these techniques often bring increased complexity, simplified architecture, or rely heavily on predefined parameters. Residual GNNs (Li et al. 2018, 2019, 2020; Chen et al. 2020; Li et al. 2021) incorporate residual connections, preserving shallow information in each layer to counteract oversmoothing. However, this approach results in the layer count increasing linearly with the receptive field radius, demanding a more flexible methodology to reduce redundancy and runtime.

Decoupled GNNs (Klicpera et al. 2019a; Rossi et al. 2020; Liu et al. 2020; Chien et al. 2021) employ a two-part architecture including message-passing and feature transformation to capture multihop information and to mitigate oversmoothing. However, the simplicity of this structure can limit the capacity of the model and its compatibility with other methods (Chen et al. 2020; Wang 2021; Li et al. 2021).

In contrast, diffusion-based GNNs (Defferrard et al. 2016; Klicpera et al. 2019b; Du et al. 2017; Wang et al. 2021; He et al. 2021; Bianchi et al. 2021) incorporate a graph diffusion step at each layer, which performs multiple message-passing steps and integrates intermediate results. The graph diffusion mentioned in this paper, directly modeled without probabilistic elements, differs from diffusion models in computer vision typically used for image noise reduction or smoothing. Despite having fewer layers, diffusion-based GNNs maintain deep receptive fields and offer better compatibility with many key GNN techniques (Chen et al. 2020; Wang 2021; Chien et al. 2022). These models may employ either explicit (Klicpera et al. 2019b) or implicit (Defferrard et al. 2016; Du et al. 2017; Wang et al. 2021; He et al. 2021) diffusion matrices, with the latter termed graph diffusion networks (GDNs). However, many GDNs utilize either predefined or naively learnable weights, hindering their ability to further capture various patterns within deep neighborhoods.

In this paper, we introduce adaptive graph diffusion networks (AGDNs), a compact and expressive class of GNNs with large receptive fields. We note that always acting as low-pass filters in the spectral domain can explain the oversmoothing of conventional GNNs. Thus, the novelty of AGDNs lies in the ability to learn arbitrary filters in the spectral domain, hence overcoming the oversmoothing issue and further improving expressiveness. To ensure this ability, with well-limited parameters, we introduce graph diffusion with hopwise attention (HA) and hopwise convolution (HC) to adaptively gather multihop information. We also incorporate positional embeddings (PEs) to enhance this ability for HA. We validate this ability with experiments of learning various filters on images. We also evaluate AGDNs on diverse and challenging open graph benchmark (OGB) (Hu et al. 2020) datasets under semisupervised node classification and link prediction tasks, and our results show that AGDNs outperform state-of-the-art (SOTA) GNNs while maintaining moderate complexity and runtime. The main contributions of this paper are as follows:

Introducing AGDNs, a compact and expressive class of GNNs with large receptive fields.
Addressing the oversmoothing problem with HA and HC with PEs.
Validating the theoretical ability of AGDNs with experiments on learning various filters.
Outperforming SOTA GNNs on diverse OGB datasets while maintaining moderate complexity and runtime.

2 Preliminaries

In this section, we summarize the important symbols in Table 1, providing concise definitions to ensure clarity and consistency throughout the manuscript. We consider an undirected graph ${\mathcal {G}}=({\mathcal {V}}, {\mathcal {E}})$ with the node set $\mathcal {V}$ and the edge set ${\mathcal {E}}$. We denote the number of nodes by $N = |\mathcal {V} |$ and the number of edges by $E = |\mathcal {E} |$. The adjacency matrix, $\varvec{A} \in \mathbb {R}^{N \times N}$, is assumed to be non-negative and symmetric. Given the symmetry of $\varvec{A}$, the degree matrix $\varvec{D}$, representing both in-degrees and out-degrees, is computed by summing the elements of each row or column of the adjacency matrix.

The normalized adjacency matrix, or transition matrix, is denoted by $\overline{\varvec{A}} \in \mathbb {R}^{N \times N}$. The graph convolution is typically represented by left-multiplication with the normalized adjacency matrix rather than the standard adjacency matrix. Normalization is achieved by scaling the matrix elements through division by the node degrees, in the form of multiplication with the inverse degree matrix $\varvec{D}^{-1}$. Various types of normalization are summarized in Table 2. This normalization facilitates comparisons across graphs with varying sizes and degree distributions. Additionally, it ensures that the largest absolute eigenvalue of $\overline{\varvec{A}}$ is equal to 1, preventing numerical expansion during iterative graph convolutions.

We denote the raw node feature matrix by ${\varvec{X}}=[\varvec{x}_1,{\varvec{x}}_2,\cdots ,{\varvec{x}}_N]^{\top } \in {\mathbb R}^{N\times d^{(0)}}$, where $d^{(0)}$ is the node feature dimension and ${\varvec{x}}_i\in {\mathbb {R}}^{d^{(0)}\times 1}$ is the feature vector of node i.

We describe an L-layer common GNN model $f({\varvec{X}}, {\varvec{A}})$ with a node feature matrix and adjacency matrix as input. This model stacks several layers $\varvec{H}^{(l)}=g^{(l)}({\varvec{H}}^{(l-1)}, {\varvec{A}})\in \mathbb R^{N\times d^{(l)}}$ ($1\le l\le L$) including nonlinear activations. We denote the intermediate node representation matrices by ${\varvec{H}}^{(l)}$ ($0\le l\le L$ and $\varvec{H}^{(0)}={\varvec{X}}$). ${\varvec{H}}^{(l)}$ can be represented with node representation vectors $\varvec{H}^{(l)}=\left[ {\varvec{h}}^{(l)}_{1}, \varvec{h}^{(l)}_{2},...,{\varvec{h}}^{(l)}_{N} \right] ^{\top }$ with ${\varvec{h}}_i^{(l)}\in {\mathbb {R}}^{N\times 1}$. The transition matrix is typically used in each layer and can be layerwise and learnable. For node classification, the softmax function is used at the output of $f({\varvec{X}}, {\varvec{A}})$. We denote the nonlinear activation by $\sigma$ and the LeakyReLU activation by $\sigma _{leak}$.

2.1 Preparation for spectral analysis

The Laplacian matrix, defined as $\varvec{L} = \varvec{D} - \varvec{A}$, is fundamental in spectral analysis, capturing graph properties through its eigenvalues and eigenvectors. Moreover, the normalized version of the Laplacian matrix $\overline{\varvec{L}}$ is more widely used but is not directly involved in graph convolutions, where repeated multiplications could lead to numerical expansion. Thus, the eigenvalues of the normalized Laplacian matrix do not need to be restricted to the $[-1, 1]$ range, as with the normalized adjacency matrix. Instead, they are confined to [0, 2] (Chung 1996; Von Luxburg 2007), which still allows for effective comparisons across graphs with different sizes and degree distributions (Chung 1996). However, its computation follows that of the normalized adjacency matrix. The multiple normalization approaches, as outlined in Table 2, all result in the same form of normalized Laplacian matrix with the normalized adjacency matrix: $\overline{\varvec{L}} = \varvec{I} - \overline{\varvec{A}}$, where the degree matrix $\varvec{D}$ becomes the identity matrix. The associated eigendecomposition is $\overline{{\varvec{L}}}={{\varvec{U}}} \varvec{\Lambda }{\varvec{U}}^{\top }$. Here, ${\varvec{U}}=[{\varvec{u}}_1, {\varvec{u}}_2,\cdots ,{\varvec{u}}_N]$ is the orthonormal eigenvector matrix and $\varvec{\Lambda }=\text {diag}(\lambda _1, \lambda _2,\cdots ,\lambda _N)$ is the eigenvalue matrix. ${\varvec{u}}_i$ is the i-th eigenvector of $\overline{{\varvec{L}}}$, and $\lambda _i\in [0,2]$ is the i-th eigenvalue of $\overline{{\varvec{L}}}$. A graph signal ${\varvec{X}}\in \mathbb R^{N\times d}$ can be transformed into the spectral domain by using $\widetilde{{\varvec{X}}}={\varvec{U}}^{\top }{\varvec{X}}$ and recovered by using ${\varvec{X}}={\varvec{U}} \widetilde{{\varvec{X}}}$. Diffusion-based GNNs and some decoupled GNNs can be viewed as polynomial filters in the spectral domain. For example, $f({\varvec{X}};\overline{\varvec{L}})=\sum _{k=0}^{K}\theta _{k}\overline{{\varvec{L}}}^k \varvec{X}$ can be represented as ${\varvec{U}} g(\varvec{\Lambda })\widetilde{ {\varvec{X}}}$, where $g(\varvec{\Lambda })=\sum _{k=0}^{K}\theta _{k}\varvec{\Lambda }^k$ and the coefficients $\theta _k$ can be predefined or learnable. Alternatively, another form of $g(\varvec{\Lambda })$ is more typically used: $g(\varvec{\Lambda })=\sum _{k=0}^{K}\theta _{k}(\varvec{I}-\varvec{\Lambda })^k$, which matches the form of graph diffusion. The c-th channel has the graph signal ${\varvec{X}}_{:,c}$ that can be decomposed into frequency components by using ${\varvec{X}}_{:,c}=\sum _{i=1}^{N} \alpha _{i,c} {{\varvec{u}}}_i$ with coefficients $\alpha _{i,c}$. In fact, for orthonormal bases ${{\varvec{u}}}_i$, $\widetilde{X}_{i,c}=\alpha _{i,c}$.

Table 1 Symbol definitions

Adaptive graph diffusion networks: compact and expressive GNNs with large receptive fields

Abstract

Similar content being viewed by others

Adaptive Randomized Graph Neural Network Based on Markov Diffusion Kernel

Deep Graph Convolutional Networks Based on Contrastive Learning: Alleviating Over-smoothing Phenomenon

Attention-enabled adaptive Markov graph convolution

Explore related subjects

1 Introduction

2 Preliminaries

2.1 Preparation for spectral analysis

3 Related works

3.1 Attention mechanisms

3.2 Diffusion-based GNNs

3.3 Other deep GNNs

4 Proposed methods

4.1 Adaptive graph diffusion networks

4.2 Learning the weighting tensor

4.2.1 Hopwise convolution

4.2.2 Hopwise attention

5 Model analysis

5.1 Spectral properties of AGDN-HC

Theorem 1

Lemma 5.1

Theorem 2

Theorem 3

5.2 AGDN-HA

Theorem 4

Proposition 5

Proof

Corollary 6

Proof

5.3 Stacking multilayer architecture

5.4 Complexity analysis

6 Experiments

6.1 Task 1: learning filters

6.2 Task 2: semisupervised node classification

6.3 Task 3: link prediction

6.4 Runtime

6.5 Parameter analysis

6.6 Ablation study

7 Discussion

8 Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Appendix a empirical analysis of HA

Appendix a empirical analysis of HA

Rights and permissions

About this article

Cite this article

Share this article

Keywords