1 Introduction

Graph neural networks (GNNs) have achieved great success in modeling diverse graph data types, such as citation networks (Kipf and Welling 2017; Vaswani et al. 2017; Klicpera et al. 2019a), social networks (Hamilton et al. 2017; Chen et al. 2018; Rossi et al. 2020), and biological graphs (Fout et al. 2017; Li et al. 2019, 2018, 2021). Fundamental tasks addressed by GNNs include semisupervised node classification (Defferrard et al. 2016; Kipf and Welling 2017; Hamilton et al. 2017; Vaswani et al. 2017; Li et al. 2019, 2020; He et al. 2021; Li et al. 2021; Chen et al. 2020; Yu et al. 2022) and link prediction (Kipf and Welling 2016; Wang et al. 2021).

In semisupervised node classification, GNNs predict the class of unlabeled nodes by using a limited set of labeled nodes and neighborhood information. Link prediction, on the other hand, is about inferring potential edges in incomplete graphs, which can also form a self-supervised learning paradigm when done with masked edges. Despite its potential, self-supervised graph learning (Kipf and Welling 2016; Pan et al. 2018; Hou et al. 2022) often falls behind supervised methods when labels are available.

GNNs often incorporate successful techniques from other areas. For instance, attention mechanisms from transformer models (Vaswani et al. 2017) are widely employed. Graph attention networks (GATs) (Velickovic et al. 2018) utilize attention scores to weigh the importance of neighbors. However, GAT still suffers from the oversmoothing issue (Li et al. 2018), a significant challenge to deep GNNs. Techniques from computer vision, such as convolution mechanisms (LeCun et al. 1998), have also been applied to GNNs for adaptive multihop feature aggregation (Zhang et al. 2018; Chien et al. 2021).

The oversmoothing problem, a common issue with deep GNNs, arises when multiple layers are stacked to capture complex neighborhood information, leading to indistinguishable node features. This issue is often attributed to the nature of GNNs, which tend to amplify components with low eigenvalues associated with more uniform graph signals. Consequently, as illustrated in Fig. 1, high-frequency components are attenuated, yielding a more uniform output. Over iterations, the lowest-frequency component increasingly dominates the output, causing oversmoothing.

Fig. 1
figure 1

Applying a low-pass filter (\(g(\lambda )=(1-0.5\lambda )^5\)) to a ring graph. Such a filter can be considered a deep 5-layer GNN. In the filtered graph signal, the ratio of high-frequency components becomes significantly lower

To alleviate the oversmoothing issue pervasive in deep GNNs, various solutions have been proposed. However, these techniques often bring increased complexity, simplified architecture, or rely heavily on predefined parameters. Residual GNNs (Li et al. 2018, 2019, 2020; Chen et al. 2020; Li et al. 2021) incorporate residual connections, preserving shallow information in each layer to counteract oversmoothing. However, this approach results in the layer count increasing linearly with the receptive field radius, demanding a more flexible methodology to reduce redundancy and runtime.

Decoupled GNNs (Klicpera et al. 2019a; Rossi et al. 2020; Liu et al. 2020; Chien et al. 2021) employ a two-part architecture including message-passing and feature transformation to capture multihop information and to mitigate oversmoothing. However, the simplicity of this structure can limit the capacity of the model and its compatibility with other methods (Chen et al. 2020; Wang 2021; Li et al. 2021).

In contrast, diffusion-based GNNs (Defferrard et al. 2016; Klicpera et al. 2019b; Du et al. 2017; Wang et al. 2021; He et al. 2021; Bianchi et al. 2021) incorporate a graph diffusion step at each layer, which performs multiple message-passing steps and integrates intermediate results. The graph diffusion mentioned in this paper, directly modeled without probabilistic elements, differs from diffusion models in computer vision typically used for image noise reduction or smoothing. Despite having fewer layers, diffusion-based GNNs maintain deep receptive fields and offer better compatibility with many key GNN techniques (Chen et al. 2020; Wang 2021; Chien et al. 2022). These models may employ either explicit (Klicpera et al. 2019b) or implicit (Defferrard et al. 2016; Du et al. 2017; Wang et al. 2021; He et al. 2021) diffusion matrices, with the latter termed graph diffusion networks (GDNs). However, many GDNs utilize either predefined or naively learnable weights, hindering their ability to further capture various patterns within deep neighborhoods.

In this paper, we introduce adaptive graph diffusion networks (AGDNs), a compact and expressive class of GNNs with large receptive fields. We note that always acting as low-pass filters in the spectral domain can explain the oversmoothing of conventional GNNs. Thus, the novelty of AGDNs lies in the ability to learn arbitrary filters in the spectral domain, hence overcoming the oversmoothing issue and further improving expressiveness. To ensure this ability, with well-limited parameters, we introduce graph diffusion with hopwise attention (HA) and hopwise convolution (HC) to adaptively gather multihop information. We also incorporate positional embeddings (PEs) to enhance this ability for HA. We validate this ability with experiments of learning various filters on images. We also evaluate AGDNs on diverse and challenging open graph benchmark (OGB) (Hu et al. 2020) datasets under semisupervised node classification and link prediction tasks, and our results show that AGDNs outperform state-of-the-art (SOTA) GNNs while maintaining moderate complexity and runtime. The main contributions of this paper are as follows:

  • Introducing AGDNs, a compact and expressive class of GNNs with large receptive fields.

  • Addressing the oversmoothing problem with HA and HC with PEs.

  • Validating the theoretical ability of AGDNs with experiments on learning various filters.

  • Outperforming SOTA GNNs on diverse OGB datasets while maintaining moderate complexity and runtime.

2 Preliminaries

In this section, we summarize the important symbols in Table 1, providing concise definitions to ensure clarity and consistency throughout the manuscript. We consider an undirected graph \({\mathcal {G}}=({\mathcal {V}}, {\mathcal {E}})\) with the node set \(\mathcal {V}\) and the edge set \({\mathcal {E}}\). We denote the number of nodes by \(N = |\mathcal {V} |\) and the number of edges by \(E = |\mathcal {E} |\). The adjacency matrix, \(\varvec{A} \in \mathbb {R}^{N \times N}\), is assumed to be non-negative and symmetric. Given the symmetry of \(\varvec{A}\), the degree matrix \(\varvec{D}\), representing both in-degrees and out-degrees, is computed by summing the elements of each row or column of the adjacency matrix.

The normalized adjacency matrix, or transition matrix, is denoted by \(\overline{\varvec{A}} \in \mathbb {R}^{N \times N}\). The graph convolution is typically represented by left-multiplication with the normalized adjacency matrix rather than the standard adjacency matrix. Normalization is achieved by scaling the matrix elements through division by the node degrees, in the form of multiplication with the inverse degree matrix \(\varvec{D}^{-1}\). Various types of normalization are summarized in Table 2. This normalization facilitates comparisons across graphs with varying sizes and degree distributions. Additionally, it ensures that the largest absolute eigenvalue of \(\overline{\varvec{A}}\) is equal to 1, preventing numerical expansion during iterative graph convolutions.

We denote the raw node feature matrix by \({\varvec{X}}=[\varvec{x}_1,{\varvec{x}}_2,\cdots ,{\varvec{x}}_N]^{\top } \in {\mathbb R}^{N\times d^{(0)}}\), where \(d^{(0)}\) is the node feature dimension and \({\varvec{x}}_i\in {\mathbb {R}}^{d^{(0)}\times 1}\) is the feature vector of node i.

We describe an L-layer common GNN model \(f({\varvec{X}}, {\varvec{A}})\) with a node feature matrix and adjacency matrix as input. This model stacks several layers \(\varvec{H}^{(l)}=g^{(l)}({\varvec{H}}^{(l-1)}, {\varvec{A}})\in \mathbb R^{N\times d^{(l)}}\) (\(1\le l\le L\)) including nonlinear activations. We denote the intermediate node representation matrices by \({\varvec{H}}^{(l)}\) (\(0\le l\le L\) and \(\varvec{H}^{(0)}={\varvec{X}}\)). \({\varvec{H}}^{(l)}\) can be represented with node representation vectors \(\varvec{H}^{(l)}=\left[ {\varvec{h}}^{(l)}_{1}, \varvec{h}^{(l)}_{2},...,{\varvec{h}}^{(l)}_{N} \right] ^{\top }\) with \({\varvec{h}}_i^{(l)}\in {\mathbb {R}}^{N\times 1}\). The transition matrix is typically used in each layer and can be layerwise and learnable. For node classification, the softmax function is used at the output of \(f({\varvec{X}}, {\varvec{A}})\). We denote the nonlinear activation by \(\sigma\) and the LeakyReLU activation by \(\sigma _{leak}\).

2.1 Preparation for spectral analysis

The Laplacian matrix, defined as \(\varvec{L} = \varvec{D} - \varvec{A}\), is fundamental in spectral analysis, capturing graph properties through its eigenvalues and eigenvectors. Moreover, the normalized version of the Laplacian matrix \(\overline{\varvec{L}}\) is more widely used but is not directly involved in graph convolutions, where repeated multiplications could lead to numerical expansion. Thus, the eigenvalues of the normalized Laplacian matrix do not need to be restricted to the \([-1, 1]\) range, as with the normalized adjacency matrix. Instead, they are confined to [0, 2] (Chung 1996; Von Luxburg 2007), which still allows for effective comparisons across graphs with different sizes and degree distributions (Chung 1996). However, its computation follows that of the normalized adjacency matrix. The multiple normalization approaches, as outlined in Table 2, all result in the same form of normalized Laplacian matrix with the normalized adjacency matrix: \(\overline{\varvec{L}} = \varvec{I} - \overline{\varvec{A}}\), where the degree matrix \(\varvec{D}\) becomes the identity matrix. The associated eigendecomposition is \(\overline{{\varvec{L}}}={{\varvec{U}}} \varvec{\Lambda }{\varvec{U}}^{\top }\). Here, \({\varvec{U}}=[{\varvec{u}}_1, {\varvec{u}}_2,\cdots ,{\varvec{u}}_N]\) is the orthonormal eigenvector matrix and \(\varvec{\Lambda }=\text {diag}(\lambda _1, \lambda _2,\cdots ,\lambda _N)\) is the eigenvalue matrix. \({\varvec{u}}_i\) is the i-th eigenvector of \(\overline{{\varvec{L}}}\), and \(\lambda _i\in [0,2]\) is the i-th eigenvalue of \(\overline{{\varvec{L}}}\). A graph signal \({\varvec{X}}\in \mathbb R^{N\times d}\) can be transformed into the spectral domain by using \(\widetilde{{\varvec{X}}}={\varvec{U}}^{\top }{\varvec{X}}\) and recovered by using \({\varvec{X}}={\varvec{U}} \widetilde{{\varvec{X}}}\). Diffusion-based GNNs and some decoupled GNNs can be viewed as polynomial filters in the spectral domain. For example, \(f({\varvec{X}};\overline{\varvec{L}})=\sum _{k=0}^{K}\theta _{k}\overline{{\varvec{L}}}^k \varvec{X}\) can be represented as \({\varvec{U}} g(\varvec{\Lambda })\widetilde{ {\varvec{X}}}\), where \(g(\varvec{\Lambda })=\sum _{k=0}^{K}\theta _{k}\varvec{\Lambda }^k\) and the coefficients \(\theta _k\) can be predefined or learnable. Alternatively, another form of \(g(\varvec{\Lambda })\) is more typically used: \(g(\varvec{\Lambda })=\sum _{k=0}^{K}\theta _{k}(\varvec{I}-\varvec{\Lambda })^k\), which matches the form of graph diffusion. The c-th channel has the graph signal \({\varvec{X}}_{:,c}\) that can be decomposed into frequency components by using \({\varvec{X}}_{:,c}=\sum _{i=1}^{N} \alpha _{i,c} {{\varvec{u}}}_i\) with coefficients \(\alpha _{i,c}\). In fact, for orthonormal bases \({{\varvec{u}}}_i\), \(\widetilde{X}_{i,c}=\alpha _{i,c}\).

Table 1 Symbol definitions

3 Related works

3.1 Attention mechanisms

The attention mechanism in transformers (Vaswani et al. 2017) computes the attention scores as the dot product of query and key vectors derived from input embeddings:

$$\begin{aligned} \text {Attention}(\varvec{Q}, \varvec{K}, \varvec{V}) = \text {softmax}\left( \frac{\varvec{Q}\varvec{K}^T}{\sqrt{d}}\right) \varvec{V}, \end{aligned}$$
(1)

where \(\varvec{K}^T\) denotes the transpose of the key matrix \(\varvec{K}\).. In graph attention networks (GATs) (Velickovic et al. 2018), the attention score between a node i and its neighbor j is computed by performing a dot product between the concatenation of their transformed feature vectors and a query vector \(\varvec{a}\) and then by performing normalization among neighbors via softmax:

$$\begin{aligned} \overline{A}_{ij} = \frac{\mathop {exp}\left( \sigma _{leak}\left( [\varvec{W}\varvec{h}_i ||\varvec{W}\varvec{h}_j]\cdot \varvec{a}\right) \right) }{\sum _{j \in \mathcal N_i}\mathop {exp}\left( \sigma _{leak}\left( [\varvec{W}\varvec{h}_i ||\varvec{W}\varvec{h}_j]\cdot \varvec{a}\right) \right) }, \end{aligned}$$
(2)

where \(\cdot\) represents the dot product and the LeakyReLU activation \(\sigma _{leak}\) is applied to introduce nonlinearity in the attention mechanism.

Both mechanisms assign different weights to inputs based on their relevance in a given context. GAT is more flexible than is GCN, but in some cases, GAT acts like GCN (NT and Maehara 2019). GAT still suffers from the oversmoothing issue. Moreover, GAT utilizes only a breadthwise attention mechanism. In this paper, we propose HA as the depthwise attention mechanism that can weigh the importance of different hops and overcome the oversmoothing issue.

3.2 Diffusion-based GNNs

Let us suppose that \(\mathcal G\) is an undirected graph with a normalized adjacency matrix (transition matrix) \(\overline{{\varvec{A}}}\) and node feature matrix \({\varvec{X}}\). Graph diffusion (Klicpera et al. 2019a, b; Page et al. 1999; Kondor and Lafferty 2002) can be described as:

$$\begin{aligned} f(\varvec{X},\overline{\varvec{A}})=\sum _{k=0}^{K}\theta _{k}\overline{\varvec{A}}^{k}{\varvec{X}}, \end{aligned}$$
(3)

with l1-normalized nonnegative weights \({\theta _{k}}\) and maximum diffusion depth K.

Diffusion-based GNNs enhance receptive fields by intensifying message-passing steps and by using fewer layers. The graph diffusion convolution (GDC) (Klicpera et al. 2019b) method computes the explicit diffusion matrix during preprocessing and the diffused feature during training and inference. However, GDC faces limitations in scalability, flexibility, and link prediction performance. Graph diffusion networks (GDNs) address these limitations by using implicit, memory-efficient graph diffusion. They iteratively calculate multihop features and aggregate them during training and inference, resulting in smaller memory requirements.

Spectral GNNs, which include diffusion-based GNNs and some decoupled GNNs (Defferrard et al. 2016; Klicpera et al. 2019a; Chen et al. 2020), employ graph diffusion as polynomial filters in the spectral domain. With preparation in Subsection 2.1, we can give the form of graph diffusion in the spectral domain:

$$\begin{aligned} f(\varvec{X},\overline{\varvec{A}}) = \varvec{U}g(\varvec{\Lambda })\widetilde{\varvec{X}}, \end{aligned}$$
(4)
$$\begin{aligned} g(\varvec{\Lambda })=\sum _{k=0}^{K}\theta _{k}(\varvec{I}-\varvec{\Lambda })^{k}, \end{aligned}$$
(5)

Thus, graph diffusion can be interpreted as a polynomial filter in the spectral domain. For comparison, ignoring feature transformations, multiple GCN layers are a special case of graph diffusion when only the K-order coefficient \(\theta _{K}\) is not zero and equal to 1. Due to the approximation power of polynomials, with a sufficiently large order K and suitable coefficients, graph diffusion can approximate any filters in the spectral domain. However, many existing GDNs, such as the topology adaptive graph convolutional network (TAGCN) (Du et al. 2017) and multihop attention graph neural network (MAGNA) (Wang et al. 2021), utilize predefined coefficients. ChebNet (Defferrard et al. 2016), ARMA (Bianchi et al. 2021) and BernNet (He et al. 2021) learn naive coefficients with complicated polynomial bases or forms. Our proposed AGDNs offer more flexible coefficients and can learn arbitrary filters even with the simplest polynomial bases, enhancing both node classification and link prediction performance.

3.3 Other deep GNNs

Residual GNNs such as the jumping-knowledge network (JKNet) (Xu et al. 2018), GCNII (Chen et al. 2020), DeepGCN (Li et al. 2019), DeeperGCN (Li et al. 2020), and RevGNNs (Li et al. 2021) preserve shallow information by using residual connections. DGCNN (Zhang et al. 2018) utilizes learnable convolution kernels to combine intermediate representations. Despite their excellent performance, these GNNs often involve large-scale architectures, leading to extended runtimes. Alternatively, our compact AGDNs with large receptive fields offer competitive performance with fewer parameters and less runtime. Residual connections can also be incorporated into AGDNs.

Decoupled GNNs such as DCNN (Defferrard et al. 2016), SGC (Wu et al. 2019), SIGN (Rossi et al. 2020), APPNP (Klicpera et al. 2019a), GPR-GNN (Chien et al. 2021), and DAGNN (Liu et al. 2020) achieve deep receptive fields with lower complexity by separating the message-passing and feature transformation steps. Despite addressing oversmoothing issues, their decoupled architecture limits model capacity and poses compatibility issues with certain GNN techniques. In contrast, AGDNs manage complexity effectively while maintaining the stacked multilayer architecture characteristic of GNNs, thereby avoiding these limitations.

4 Proposed methods

In this section, we first introduce the framework of AGDNs by proposing a novel symmetric graph attention network (GAT) (Vaswani et al. 2017) transition matrix. Then, we propose two scalable mechanisms with positional embeddings for combining multihop information.

4.1 Adaptive graph diffusion networks

For the following discussion, we omit the condition that the last L-th layer does not include a nonlinear activation. A common L-layer GNN model follows stacking multilayer architecture: \(f({\varvec{X}}, \varvec{A})=f^{(L)}(...f^{(2)}(f^{(1)}({\varvec{X}},\varvec{A}),{\varvec{A}})..., {\varvec{A}})\).

Fig. 2
figure 2

Model architecture of an AGDN model. a We preserve the stacking architecture of common GNNs in AGDNs. b In each AGDN layer, we perform feature transformation and adaptive graph diffusion. c We utilize generalized weights for adaptive graph diffusion. The operator \(\otimes\) represents matrix multiplication, the bold operator \(\varvec{\odot }\) represents elementwise multiplication, and the operator \(\oplus\) represents summation

For an AGDN model, the l-th layer, which outputs the l-th node representation matrix \(\varvec{H}^{(l)}\), can be described as follows:

$$\begin{aligned} \varvec{H}^{(l)}=f^{(l)}({\varvec{H}}^{(l-1)}, \varvec{A})=\sigma \left( \sum ^{K}_{k=0}{\varvec{\Theta }}^{(l)}_{:,k,:}\odot (\overline{\varvec{A}}^{(l),k}{\varvec{H}}^{(l-1)}{\varvec{W}}^{(l)})\right) , \end{aligned}$$
(6)

where \(\odot\) refers to the elementwise multiplication (Hadamard product). We also illustrate the overall architecture of an AGDN model in Fig. 2. To enhance the flexibility and expressiveness of GDNs, we introduce variable weights for different nodes or feature channels. These weights are represented by a 3-dimensional weighting tensor \(\varvec{\Theta }^{(l)}\in \mathbb R^{N\times (K+1)\times d^{(l)}}\). Instead of directly multiplying the representation matrix with a scalar \(\theta _{k}^{(l)}\), which is the associated element of the weighting vector \(\varvec{\theta }^{(l)} \in {\mathbb {R}}^{K+1}\), we perform an elementwise multiplication between the representation matrix and a matrix \(\varvec{\Theta }_{:,k,:}^{(l)}\in {\mathbb {R}}^{N\times d^{(l)}}\), which is the associated slice of the weighting tensor \(\varvec{\Theta }^{(l)}\). A larger element \({\Theta }^{(l)}_{i,k,c}\) indicates that the k-th hop representation has a greater contribution to the output representation for node i and the c-th channel. GDNs are a special case of AGDNs when all elements in \(\varvec{\Theta }^{(l)}_{:,k,:}\) are the same. GNNs following the MPNN framework are a special case of AGDNs if for all layer l, all elements in \(\varvec{\Theta }^{(l)}_{:,k,:}\) are 1 if \(k=1\) and 0 if \(k\ne 1\).

We omit the computation of transition matrix \(\overline{\varvec{A}}^{(l)}\) in Eq. 6. The superscript (l) is necessary if we utilize the GAT transition matrix and its variant. We summarize the common transition matrices in Table 2. The transition matrix used in graph diffusion is the normalized adjacency matrix, which is also used in graph convolution. Since the symmetric GCN (Kipf and Welling 2017) transition matrix has been shown to be more effective on specific datasets, we propose a symmetric variant of the GAT (Vaswani et al. 2017) transition matrix to improve performance on these datasets, denoted as \(\overline{\varvec{A}}_{\text {gat.sym}}\) in Table 2.

Table 2 Transition matrices

4.2 Learning the weighting tensor

The critical issue is to obtain suitable weights. To simplify the discussion, we reformulate the l-th AGDN layer, previously described in Eq. 6, from both channel and node viewpoint.

First, associated with the right-hand side of \(\odot\) in Eq. 6, we iteratively compute the multihop node representation vectors for node i at the l-th layer as follows:

$$\begin{aligned} \hat{\varvec{h}}^{(l,k)}_{i,c}= {\left\{ \begin{array}{ll} \varvec{h}_i^{(l-1,0)} \varvec{W}^{(l)}, \text {if } k=0; \\ \overline{\varvec{A}}^{(l)} \hat{\varvec{h}}^{(l,k-1)}_j, \text {if } 0<k\le K, \end{array}\right. } \end{aligned}$$
(7)

where the initial 0-th hop representation vector, \(\varvec{h}_{i}^{(l,0)}\), is generated by applying the l-th layer’s linear transformation matrix, \(\varvec{W}^{(l)}\).

Then, associated with the elementwise multiplication and summation in Eq. 6, we aggregate the multihop node representations \(\varvec{h}_i^{(l,k)}\) with nodewise and channelwise weights \(\Theta _{i,k,c}^{(l)}\) from the weighting tensor \(\varvec{\Theta }^{(l)}\):

$$\begin{aligned} h_{i,c}^{(l)}=\sigma \left( \sum _{k=0}^{K}\Theta ^{(l)}_{i,k,c}\hat{h}^{(l,k)}_{i,c}\right) , \end{aligned}$$
(8)

where \(h_{i,c}^{(l)}\), associated with node i and c-th channel, corresponds to the (ic) entry of the node representation matrix at the l-th layer \(\varvec{H}^{(l)}\) defined in Eq. 6.

Learning an adaptive weighting tensor directly can be challenging, as it leads to a significant increase in the number of parameters. To address this issue, we introduce hopwise convolution (HC) and hopwise attention (HA) as methods to obtain an adaptive weighting tensor with a reasonable number of parameters. By applying different weights, we obtain three variants of AGDN: AGDN-mean, AGDN-HA, and AGDN-HC. AGDN-mean utilizes uniform weights and can be viewed as a GDN.

Fig. 3
figure 3

HA and HC. We omit the superscript (l). HC (Left): The weighting tensor of hopwise convolution \(\varvec{\Theta }\) is directly derived from \((K+1)\times d\) kernel matrix \(\varvec{\Theta }^{\text {HC}}\) by repeating N times along the first dimension. HA (Right): The hopwise attention is parameterized by \({\varvec{a}}^{\text {HA}}\) with a LeakyReLU activation function \(\sigma _{leak}\) and normalized along hops with the softmax function. The associated weighting tensor \(\varvec{\Theta }\) can be derived from the \(N\times (K+1)\) matrix \(\varvec{\Theta }^{\text {HA}}\) by repeating d times along the third dimension

4.2.1 Hopwise convolution

We propose a simple mechanism called hopwise convolution (HC) that utilizes only \((K+1)\times d^{(l)}\) parameters. The channelwise filters in the spectral domain are important to produce multidimensional predictions (Wang and Zhang 2022). Thus, we define a learnable weighting tensor with the constraint \({\Theta }_{i,k,c}^{(l)}={\Theta }_{1,k,c}^{(l)}\) for all i. However, to minimize the number of parameters, we assume that all nodes share the same weights. This allows us to simplify the weighting tensor into a 2-dimensional matrix \(\varvec{\Theta }^{\text {HC},(l)}\in {\mathbb {R}}^{(K+1)\times d^{(l)}}\) by ignoring the first subscript i. We can recover the complete weighting tensor \(\varvec{\Theta }^{(l)}\) by adding the first dimension and by repeating \(\varvec{\Theta }^{\text {HC},(l)}\) N times in the first dimension, as illustrated in the right part of Fig. 3. In short, the output of the AGDN-HC can be formulated as follows:

$$\begin{aligned} h^{(l)}_{i,c}=\sigma \left( \sum _{k=0}^{K}\Theta ^{\text {HC},(l)}_{k,c}\hat{h}^{(l,k)}_{i,c}\right) , \end{aligned}$$
(9)

where we change Eq. 8 by replacing \(\Theta _{i,k,c}\) with the elements \([\Theta _{k,c}^{HC,(l)}]_{0\le k\le K, 1\le c\le d^{(l)}}\) of the HC kernel \(\varvec{\Theta }^{HC,(l)}\). Consequently, in Eq. 9, the multihop representation vectors’ c-th channels \([\hat{h}_{i,c}^{(l,k)}]_{0\le k\le K}\) are aggregated with learnable channelwise weights \([\Theta _{k,c}^{HC,(l)}]_{0\le k\le K}\), which are directly optimized during training.

Moreover, similar to CNNs (Krizhevsky et al. 2012), HC has different kernels in different channels. However, HC does not slide along hops and only computes a single output combining all hops.

4.2.2 Hopwise attention

HC still directly learns hop weights, which may be sensitive to initialization and difficult to optimize. Thus, inspired by the attention mechanism in GAT (Vaswani et al. 2017), we propose hopwise attention (HA) to induce hop weights with \(2d^{(l)}\) parameters from a learnable query vector \({\varvec{a}}^{\text {HA},(l)}\in {\mathbb {R}}^{2d^{(l)}}\). We suppose that hop weights should be node-specific and normalized: \(\sum _{k=0}^{K}{\Theta }_{i,k,c}^{(l)}=1, \forall i,c\). For convenience, we ignore the differences along channels: \({\Theta }_{i,k,c}^{(l)}={\Theta }_{i,k,1}^{(l)}, \forall c\). We simplify the unified weighting tensor into a 2-dimensional weighting matrix \(\varvec{\Theta }^{\text {HA},(l)}\in {\mathbb {R}}^{N\times (K+1)}\), by ignoring the last subscript c. \(\varvec{\Theta }^{(l)}\) can be recovered by adding the third dimension and repeating \(\varvec{\Theta }^{\text {HA},(l)}\) for \(d^{(l)}\) times in the third dimension, as shown in the left part of Fig. 3.

The HA weighting tensor is calculated as follows:

$$\begin{aligned} {\Theta }^{\text {HA},(l)}_{i,k}=\frac{\textrm{exp}\left( \sigma _{leak}\left( \left[ \hat{{\varvec{h}}}^{(l,0)}_{i}\left|\right|\hat{{\varvec{h}}}^{(l,k)}_{i}\right] \cdot {\varvec{a}}^{\text {HA},(l)}\right) \right) }{\sum _{k=0}^{K} {\textrm{exp}\left( \sigma _{leak}\left( \left[ \hat{{\varvec{h}}}^{(l,0)}_{i}\left|\right|\hat{{\varvec{h}}}^{(l,k)}_{i}\right] \cdot {\varvec{a}}^{\text {HA},(l)}\right) \right) }}, \end{aligned}$$
(10)

where \(\cdot\) represents the inner product, \(||\) represents the concatenation operation, k represents the k-th hop’s representation and \(\varvec{a}^{\text {HA},(l)}\) represents the query vector of the HA. Finally, we formulate the output of AGDN-HA as follows:

$$\begin{aligned} h^{(l)}_{i,c}=\sigma \left( \sum _{k=0}^{K}\Theta ^{\text {HA},(l)}_{i,k}\hat{h}^{(l,k)}_{i,c}\right) . \end{aligned}$$
(11)

The form of HA is similar to GAT (Vaswani et al. 2017). However, there are significant differences between them. First, the GAT calculates the edge scores used in the message-passing step, while the HA calculates the weights used in combining multihop information in the graph diffusion step. Second, the edge scores of the GAT are normalized among the 1st hop neighborhood of the destination node, while the weights of the HA are normalized among the multihop smoothed representations of the target node. In short, HA is depthwise, while GAT is breadthwise.

Positional embedding

To enhance the position (hop) information, we can add multihop representation vectors with learnable positional embeddings (PEs) \({\varvec{p}}^{(k)},k\in \{0,1,\cdots ,K\}\) in \({\mathbb {R}}^{d^{(l)}\times 1}\). Such PEs are identical for all nodes. PEs can encode additional hop positional information and theoretically allow AGDN with HA to achieve better universality and be able to learn arbitrary filters. We discuss the role of PEs in Subsect. 5.2.

5 Model analysis

In this section, we begin by discussing the expressiveness of AGDN-HC and AGDN-HA. We provide simple spectral properties of AGDN-HC, and then, we present the universality theorem of AGDN-HA from a spatial perspective. We explain why stacking multilayer architecture is necessary for better capturing diverse information. Finally, we provide the time and space complexity of AGDNs.

5.1 Spectral properties of AGDN-HC

To simplify our analysis, we first consider the constrained condition where all nodes are assigned the same hop weights. This allows us to establish a lower bound on the expressiveness of AGDNs in general. First, we only consider 1-dimensional output. An AGDN layer applied on a graph signal \({\varvec{X}}\in {\mathbb {R}}^{N\times d}\) is described as follows:

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K}\Theta ^{\text {HC}}_{k} {\overline{{\varvec{A}}}}^{k}{\varvec{X}}&= {\varvec{U}}\left( \sum _{k=0}^{K}\Theta ^{\text {HC}}_{k}({\varvec{I}} - {\varvec{\Lambda }})^{k}\right) {\varvec{U}}^{\top }{\varvec{X}} \\&={\varvec{U}} g({\varvec{\Lambda }})({{\varvec{U}}}^{\top }{\varvec{X}}), \end{aligned} \end{aligned}$$
(12)

where \(g:[0,2]\rightarrow {\mathbb {R}}\) denotes a spectral filter in the spectral domain and \(g({\varvec{\Lambda }})\) applies \(g(\lambda )=\sum _{k=0}^{K}\theta _{k}(1-\lambda )^k\) elementwise to the diagonal entries of \(\varvec{\Lambda }\). The l-th AGDN layer acts as a polynomial filter \(g^{(l)}\) in the spectral domain, with coefficients given by the hop weights. Ignoring the differences in transition matrices, nonlinear activations, and feature transformations, the overall AGDN model is a nested polynomial filter \(g(\lambda )=g^{(L)}(\cdots g^{(2)}(g^{(1)}(\lambda ))\cdots )\). For the multidimensional output of the channelwise AGDN-HC, we can add the subscript c for \(\theta\) and g in Eq. 12.

Theorem 1

(Weierstrass Approximation Theorem). We suppose h is a continuous real-valued function defined on the real interval [ab]. For every \(\epsilon >0\), there exists a polynomial p such that all x in [ab], we have \(|f(x)-p(x)|< \epsilon\).

Lemma 5.1

For a connected graph \(\mathcal G\) with Laplacian matrix \(\overline{{\varvec{L}}}\) whose eigenvalues are \(0=\lambda _1 \le \lambda _2\le ... \le \lambda _N\le 2\), ignoring feature transformation and nonlinear activation, an AGDN-HC layer f can be viewed as a polynomial filter with learnable coefficients \(g(\lambda )=\sum _{k=0}^{K}\Theta ^{HC}_{k}(1-\lambda )^k\) in the spectral domain. We suppose p is a polynomial defined on [0, 2]. For every \(\epsilon >0\), there exists an AGDN-HC layer f acting as a polynomial filter g such that all \(\lambda\) in [0, 2], and we have \(|g(\lambda )-p(\lambda )|<\epsilon\).

Based on Theorem 1 and Lemma 5.1, we can easily obtain the following theorem:

Theorem 2

For a connected graph \(\mathcal G\) with Laplacian matrix \(\overline{{\varvec{L}}}\) whose eigenvalues are \(0=\lambda _1 \le \lambda _2\le ... \le \lambda _N\le 2\), ignoring feature transformation and nonlinear activation, an AGDN-HC layer f can approximate arbitrary continuous real-valued function h defined on the real interval [0, 2] with sufficiently large K.

However, as discussed in (Wang and Zhang 2022), to generate arbitrary output, we have to add feature transformation and consider the characteristics of eigenvalues and the given input graph signal. The sufficient conditions can be expressed in this universality theorem:

Theorem 3

(Wang and Zhang 2022). An AGDN-HC layer with a feature transformation or other spectral GNNs that can approximate arbitrary spectral filters can produce any 1-dimensional prediction if \(\overline{{\varvec{L}}}\) has no multiple eigenvalues and the node feature \({\varvec{X}}\in {\mathbb {R}}^{N\times d}\) contains all frequency components (\(\alpha _i>0\forall i\in [1,2,\cdots ,N]\)).

The two conditions can be easily satisfied or approximately satisfied in real-world datasets. The channelwise filters can further ensure the production of arbitrary multidimensional predictions.

The above theorems are also applicable to several spectral GNNs (Defferrard et al. 2016; Chien et al. 2021; Bianchi et al. 2021; He et al. 2021; Wang and Zhang 2022), which can be regarded as polynomial filters with learnable coefficients. However, in many cases, it can be challenging to directly learn coefficients and to achieve optimal performance. Some spectral GNNs (Defferrard et al. 2016; He et al. 2021; Wang and Zhang 2022) utilize special polynomial bases to facilitate optimization. In this paper, we propose stacking multiple AGDN-HC layers and using channelwise filters to increase the ability to learn arbitrary filters. The channelwise filters are crucial when separating and preserving different frequency components of the same input graph signal. We demonstrate that even using the simplest form, the AGDN-HC can still significantly outperform most other spectral GNNs in learning all example filters.

5.2 AGDN-HA

AGDN-HA uses nonnegative and normalized hop weights, while AGDN-HC uses arbitrary hop weights. HA provides effective control over the magnitude of combined diffusion results, making it less sensitive to initialization. Additionally, HA is based on input representations and has better generalizability. The nodewise weights are critical for the AGDN-HA to learn various filters. First, we still consider the condition that all nodes share the same HA weights. Then, we can obtain exactly the same form as Eq. 12 but with different constraints on hop weights: \(\Theta ^{\text {HA}}_{k}\ge 0\) and \(\sum _{k=0}^{K}\Theta ^{\text {HA}}_{k}=1\). With these constraints, AGDN-HA can learn only low-pass filters.

Theorem 4

(Chien et al. 2021) For a connected graph \(\mathcal G\) with Laplacian matrix \(\overline{{\varvec{L}}}\) whose eigenvalues are \(0=\lambda _1 \le \lambda _2\le ... \le \lambda _N\le 2\), ignoring feature transformation and nonlinear activation, a constrained AGDN-HA layer f can be viewed as a polynomial filter \(g(\lambda )=\sum _{k=0}^{K}\Theta ^{\text {HA}}_{k}(1-\lambda )^k\) with nonnegative and l1-normalized learnable coefficients: \(\theta _{k}\ge 0\forall k\in \{0,1,\cdots ,K\}\),\(\sum _{k=0}^{K}\Theta ^{\text {HA}}_{k}=1\) in the spectral domain. If \(\exists k^{\prime }>0\) such that \(\Theta ^{\text {HA}}_{k^{\prime }}>0\), g can only learn a low-pass filter and \(|g(\lambda _1)|> |g(\lambda _i)|,\forall i\ge 2\).

The proof can be found in (Chien et al. 2021).

In AGDN-HA, the use of nodewise weights is crucial in overcoming the limitation discussed in Theorem 4. However, since the nodewise weights can easily change eigenvectors, the spectral analysis for AGDN-HA is very difficult. Thus, we directly obtain AGDN-HA’s universality for multidimensional cases from a spatial perspective by using a strong condition, which includes the ability to learn arbitrary filters. To clarify, by using the node viewpoint form in Eqs. 8 and 11, we define the multihop node feature vectors for node i as \(\hat{{\varvec{x}}}^{(k)}_i=\sum _{j\in \mathcal {N}_i}\overline{A}_{i,j}\hat{{\varvec{x}}}^{(k-1)}_j\) in \({\mathbb {R}}^{d\times 1}\) (\(\hat{{\varvec{x}}}^{(0)}_i={\varvec{x}}_i\)). We find that for each node i, all possible values of its associated output vector \(\overline{{\varvec{x}}}_i=\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}\hat{{\varvec{x}}}^{(k)}_i\) constitute the convex hull \(Q_{i}\) of its multihop feature vectors \(\{\hat{{\varvec{x}}}^{(0)}_{i}, \hat{{\varvec{x}}}^{(1)}_{i},\cdots ,\hat{{\varvec{x}}}^{(K)}_{i}\}\), since \(\forall i\), \(\Theta ^{\text {HA}}_{i,k}\ge 0\) and \(\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}=1\). As HA weights are node specific, all possible values of the set \(\{\overline{{\varvec{x}}}_1,\overline{{\varvec{x}}}_2,\cdots ,\overline{{\varvec{x}}}_N\}\) constitute the set \(Q_1\times Q_2\cdots \times Q_N\).

Proposition 5

An AGDN-HA layer with a learnable scaling factor w can produce any d-dimensional prediction \({\varvec{Z}}\in {\mathbb {R}}^{N\times d}\) if for every node i, there exists a d-dimensional ball centered at the origin belonging to the convex hull of its multihop feature vectors given a d-dimensional node feature input \({\varvec{X}}\in {\mathbb {R}}^{N\times d}\).

Proof

With learnable w, for each node i, its output vector is defined as \({{\varvec{x}}}^{\prime }_i=w\overline{{\varvec{x}}}^{(k)}_i=w\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}\hat{{\varvec{x}}}^{(k)}_i\). The target vector for node i is denoted by \({\varvec{z}}_i\in {\mathbb {R}}^{d\times 1}\) and \({\varvec{Z}}=[{\varvec{z}}_1, {\varvec{z}}_2,\cdots ,{\varvec{z}}_N]^{\top }\). The condition that there exists a d-dimensional ball centered at the origin belonging to \(Q_i\) is equivalent to \(\forall i, \exists \epsilon _i>0,\{\overline{{\varvec{x}}}_{i}\mid ||\overline{{\varvec{x}}}_{i}||_2<\epsilon _i\}\subset Q_{i}\). The theorem becomes proving that \(\forall {\varvec{Z}}\in {\mathbb {R}}^{N\times d}, \exists \{\overline{{\varvec{x}}}_1,\overline{{\varvec{x}}}_2,\cdots ,\overline{{\varvec{x}}}_N\}\in Q_1\times Q_2\cdots \times Q_N\) such that \(\forall i, w\overline{{\varvec{x}}}_i={{\varvec{z}}}_{i}\). For any target prediction \({\varvec{Z}}\), by setting \(w>\sup _{i}{\frac{||{\varvec{z}}_{i}||_2}{\epsilon _i}}\), we can simply construct \(\overline{{\varvec{x}}}_i=\frac{1}{w}{{\varvec{z}}}_{i}\) belonging to \(Q_i\) because \(||\overline{{\varvec{x}}}_i||_2=\frac{1}{w}||{\varvec{z}}_i||_2<\epsilon _i\). In short, by setting a sufficiently large scaling factor w, we can always find a feasible set \(\{\overline{{\varvec{x}}}_1,\overline{{\varvec{x}}}_2,\cdots ,\overline{{\varvec{x}}}_N\}\) for producing any target prediction. Therefore, the proof is complete. \(\square\)

In practical datasets, it may be difficult to satisfy the strong condition of this theorem. Since we need to construct a d-dimensional convex hull, we should set \(K\ge d\) and ensure that for every node i, its multihop matrix \([\hat{{\varvec{x}}}^{(0)}_i,\hat{{\varvec{x}}}^{(1)}_i,\cdots ,\hat{{\varvec{x}}}^{(K)}_i]\) has a column rank equal to d. In many cases, with d that is too large, the oversmoothing problem causes high hop vectors to tend to be identical; thus, the rank is typically not d. We incorporate learnable multihop embeddings (PEs) \({\varvec{p}}^{(k)}\) in \({\mathbb {R}}^{d\times 1}\) to solve this problem.

Corollary 6

By incorporating learnable PEs \({\varvec{p}}^{(k)}\) and a learnable scaling factor w, the AGDN-HA layer can produce any d-dimensional prediction \({\varvec{Z}}\) given any d-dimensional graph signal \({\varvec{X}}\).

Proof

We denote the output by \({{\varvec{x}}}^{\prime }_{i} = w\overline{{\varvec{x}}}i+\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}{\varvec{p}}^{(k)}\). To demonstrate this corollary, we consider a special case where w is zero, i.e., only the linear combination of PEs is considered. For each node i, the set of all possible outputs \({{\varvec{z}}}^{\prime }_{i}\) is exactly the convex hull P of \(\{{\varvec{p}}^{(0)},{\varvec{p}}^{(1)},\cdots ,{\varvec{p}}^{(K)}\}\). We can construct \({\varvec{p}}^{(k)}\) satisfying the condition of Proposition 5. For example, we can set \(K=d\) and let \({\varvec{p}}^{(k)}\) contain d standard basis vectors and a single vector \([-1, -1,\cdots , -1]^{\top }\). We ensure that P is d-dimensional and contains the origin (by setting \(\Theta ^{\text {HA}}_{i,k}=\frac{1}{d+1}\)); thus, there exists a d-dimensional ball centered at the origin that is a subset of P. We denote the radius of this ball as \(\epsilon\). Then, we enlarge this radius to \(w\epsilon\) by multiplying an additional scaling factor \(w^{\prime }>\frac{\sup _{i}||{\varvec{z}}_i||_2}{\epsilon }\) to every \({\varvec{p}}^{(k)}\). For all i, we always have \(\varvec{z}_i\in P\) since \(\sup _i||{\varvec{z}}_i||_2<w^{\prime }\epsilon\), i.e., for any target prediction \({\varvec{Z}}\) and every i, \(\exists \varvec{\Theta }^{\text {HA}}\in {\mathbb {R}}^{N\times (K+1)}\) such that \(\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}{\varvec{p}}^{(k)}={\varvec{z}}_{i}\). Hence, the proof is completed. \(\square\)

It is worth mentioning that the value of K in the AGDN-HA layer may be significantly lower than d since the dimension of the smallest convex subspace that encompasses all target vectors can be small, and complete universality may not always be needed. Additionally, in actual cases, the first term \(w\overline{{\varvec{x}}}_i\) involving input features is not zero, and the second term \(\sum _{k=0}^{K}\Theta ^{\text {HA}}_{i,k}{\varvec{p}}^{(k)}\) can offer complementary capability for approximating information that cannot be approximated solely by the first term. In addition, we also conduct an empirical analysis of the HA weights in the Appendix.

5.3 Stacking multilayer architecture

The stacking multilayer architecture with intermediate feature transformations and nonlinear activations is a critical aspect of AGDNs, as it improves model capacity and maintains compatibility with other important techniques. From a spectral perspective, the multilayer architecture is essential for leveraging variable filters. We assume that variable filters such as low-pass, high-pass, or other filters within the same model are crucial for capturing diverse information.

First, feature transformations between graph diffusion operators can enhance the quality of graph signals. By Theorem 3, we can ensure the production of arbitrary output by excluding the multiple eigenvalues and missing frequency components in the node features. Specifically, given a 1-dimensional graph signal \({\varvec{X}}=\sum _{i=1}^N \alpha _i {{\varvec{u}}}i\) and a graph filter \(g(\lambda )\), the output becomes \(\sum _{i=1}^N g(\lambda _i) \alpha _i {{\varvec{u}}}_i\). Consequently, if \(\alpha _i=0\), then the output signal always misses the i-th component. Feature transformation, also known as channel mixing, can alleviate this issue by mixing graph signals from different channels.

However, if all graph signals have similar frequency components, linearly mixing them cannot add other frequency components. In such cases, nonlinear activations can be beneficial. For a graph signal \({\varvec{X}}\) and its spectral signal \(\widetilde{{\varvec{X}}}\), we apply a graph filter g with nonlinear activation \(\sigma\): \({\varvec{U}} g(\varvec{\Lambda }) {{\varvec{U}}}^{\top }\sigma (\varvec{X})={\varvec{U}} g(\varvec{\Lambda }){\varvec{U}}^{\top }\sigma ({\varvec{U}} \widetilde{{\varvec{X}}})\). As explained in (Wang and Zhang 2022), we can define the activation function in the spectral domain as \(\sigma ^{\prime }(\widetilde{{\varvec{X}}})={\varvec{U}}^{\top } \sigma ({\varvec{U}} \widetilde{{\varvec{X}}})\). For each column of \(\widetilde{{\varvec{X}}}\), the frequency components from different rows are first linearly mixed with the eigenvector basis \({\varvec{U}}\), then elementwise applied with \(\sigma\), and finally transformed back with \({\varvec{U}}^{\top }\). The nonlinear operation \(\sigma\) ensures that linearly mixed components become nonlinearly mixed and cannot be recovered to original separated ones by left-multiplying with the transposed orthogonal eigenvector matrix \({{\varvec{U}}}^{\top }\), thereby allowing \(\sigma ^{\prime }\) to easily change elements of \(\widetilde{{\varvec{X}}}\).

In actual datasets, raw node features may rarely have missing frequency components. However, after a GNN layer or graph filter, the output graph signal can easily tend to contain certain frequency components. For instance, a low-pass filter emphasizes the low-frequency components, so applying another high-pass filter to the output of this filter does not yield enough valuable information. With feature transformations and nonlinear activations in the stacking multilayer architecture, we can mitigate this problem.

5.4 Complexity analysis

The time complexity of an L-layer AGDN model is \(O(LKEd+LKNd+LNd^2)\), but it is simplified to \(O(LKEd+LNd^2)\) under the assumptions of \(N\ll E\) and \(K\ll d\). The adaptive diffusion mechanism in AGDN layers increases their time complexity compared to common GNN layers due to K-hop aggregations, elementwise multiplications, and optional HA computation. However, this mechanism significantly enhances the expressiveness of each AGDN layer. As a result, a compact AGDN model with fewer layers can match or surpass the performance of typical GNN models, where fewer layers counteract the increased complexity per layer. Furthermore, AGDNs are orthogonal to memory-saving techniques (Li et al. 2018; Zeng et al. 2020; Li et al. 2021) for GNNs, making them a promising approach for large-scale graph learning tasks.

Table 3 Complexity comparison

6 Experiments

In this section, we present the evaluation of AGDNs on various datasets, involving tasks such as learning arbitrary filters, node classification, and link prediction. For the OGB datasets, we utilize their official data splits and metrics, which have been carefully selected in the paper. For the heterophily-prone datasets, we perform a random split of the data into training, validation, and test sets according to fixed ratios in each run. We also conducted additional experiments with complementary metrics. Table 4 provides a comprehensive summary of these datasets. We conducted all experiments on a single Nvidia Tesla V100 with 16 GB GPU memory. To address the memory constraints, we employed graph-based sampling techniques such as random partition (Li et al. 2020; Shi et al. 2021) for ogbn-proteins and ogbn-products and GraphSAINT (Zeng et al. 2020) for ogbl-citation2. For all except ogbn-products and ogbl-citation2 (which were evaluated on CPU), we performed both training and inference of AGDNs on the same GPU card. We conducted 10 runs of each proposed model for the last two tasks, fixing random seed 0–9, and reported means. In comparison tables, we highlighted the best results with bold fonts. The unavailable results were indicated by –, as some baselines had not been implemented or reported for certain datasets. We listed the critical hyperparameters in Table 5.

Table 4 Dataset statistics
Table 5 Hyperparameters of AGDNs. K=diffusion depth for each layer, L=number of layers, d=hidden dimension, heads=number of multiple heads, \(\overline{\varvec{A}}\)= transition matrix, \(\varvec{\Theta }\)= weighting tensor style (HA or HC)

6.1 Task 1: learning filters

We conducted an experiment on 50 real images from the image processing toolbox in MATLAB with a resolution of \(100\times 100\) following the approach of (He et al. 2021). We treated each image as a 2D regular 4-neighborhood grid graph with a related adjacency matrix of \(10000\times 10000\) and a node feature vector of \(1000\times 1\). The task was to learn 5 different filters (low-pass, high-pass, bandpass, band-rejection, and comb) by applying models with graph signals as input. The formulas for the filters are shown in Table 6. The loss function used was the sum of squared error between the output of the models and the filtering signal from the target filter, which made this a simple regression task. We utilized six different baseline models: GCN (Kipf and Welling 2017), GAT (Vaswani et al. 2017), ARMA (Bianchi et al. 2021), ChebyNet (Defferrard et al. 2016), GPR-GNN (Chien et al. 2021), BernNet (He et al. 2021), and JacobiConv (Wang and Zhang 2022). The reported results of the baselines were taken from (He et al. 2021) except for JacobiConv since its paper does not provide \(R^2\) scores. We re-evaluate JacobiConv with \(R^2\) scores reported. To implement the AGDN, we designed a 2-layer architecture with 8 hidden dimensions (4 heads) in the first layer and 8 hidden dimensions (8 heads) in the second layer, following the setting of the GAT in (He et al. 2021). We used a linear layer at the end of the model to output predicted values and an ELU as an intermediate nonlinear activation function. We scaled up the scores and removed LeakyReLU before softmax to enhance the imbalance in HA weights, which empirically improved performance. To ensure a fair comparison, we used the default hyperparameters from (He et al. 2021), except for the early stopping parameter: learning rate 0.01, 2000 epochs, and early stopping 300. We utilize PEs in the AGDN. We did not use regularization techniques or tune the hyperparameters with tools, unlike (Wang and Zhang 2022).

Table 6 shows the average sum of squared error and \(R^2\) scores. As expected in (Balcilar et al. 2021), GCN and GAT are capable of learning only low-pass filters. On the other hand, GPR-GNN, ARMA, and ChebyNet can learn several filters, but they are significantly outperformed by AGDNs, BernNet, and JacobiConv on all tasks. AGDN-HC outperforms GPR-GNN on all tasks with a similar structure, demonstrating the effectiveness of the stacking multilayer architecture. BernNet and JacobiConv also outperform AGDNs with more complicated bases and carefully tuned hyperparameters. The margin between AGDNs and BernNet or JacobiConv is small and acceptable considering that we did not design AGDNs with polynomial bases. Incorporating Bernstein bases into AGDN, as denoted with AGDN-HA* and AGDN-HC* in Table 6, leads to improved performance, outperforming BernNet on all datasets except the high-pass filter, and surpassing JacobiConv on the band-pass and comb filters. AGDN-HC outperforms AGDN-HA on most datasets. This is reasonable since these tasks require the models to learn the same filters applied to all nodes, making node-specific HA weights difficult to optimize. Nonetheless, validating Corollary 6, the AGDN-HA can still learn all filters and outperform other baselines except BernNet and JacobiConv. Furthermore, by tuning a few hyperparameters (\(L=3\), \(K=15\), \(d=16\), \(heads=8\)), the resulting AGDN-HC* (3-layer) significantly improves performance and outperforms all baselines on all filters, except for JacobiConv on the low-pass filter, where the gap is minimal. These results demonstrate that both AGDN-HA and AGDN-HC have the powerful ability to learn arbitrary filters in the spectral domain and can be further improved with certain polynomial bases.

Table 6 Task 1: Learning filters on 50 images

6.2 Task 2: semisupervised node classification

For semisupervised node classification, we evaluated AGDNs on three homophily-prone datasets (ogbn-arxiv, ogbn-proteins and ogbn-products) (Hu et al. 2020) and three heterophily-prone datasets (Chameleon, Squirrel and Actor) (Rozemberczki et al. 2021; Pei et al. 2020). We select the best model based on validation scores.

For the first three datasets, we utilize GCN (Kipf and Welling 2017), GraphSAGE (Hamilton et al. 2017), DeeperGCN (Li et al. 2020), GCNII (Chen et al. 2020), MAGNA (Wang et al. 2021), UniMP (Shi et al. 2021), LEGNN (Yu et al. 2022) and RevGNN (Li et al. 2021) as baselines. LEGNN is a recently proposed label-enhanced GNN. RevGNN has been shown to be SOTA GNN and is still the main backbone of recent SOTA submissions with other tricks on OGB leaderboard. We also compare BernNet (He et al. 2021), utilizing its official implementation with the same hyperparameters used for the OGB baselines (3 layers with 256 hidden dimension, learning rate 0.01, weight decay 0, and 1000 epochs). We adopt the official dataset split from (Hu et al. 2020). We incorporate several popular tricks for AGDN on ogbn-arxiv, including the bag of tricks on GNNs (BoT) (Wang 2021), self-knowledge distillation (Self-KD) (Zhang et al. 2019), and graph information-aided node feature extraction with XR-transformers (GIANT-XRT) (Chien et al. 2022).

For the last three datasets, we utilize MLP, GCN (Kipf and Welling 2017), GAT (Vaswani et al. 2017), APPNP (Klicpera et al. 2019a), ChebyNet (Defferrard et al. 2016), GPR-GNN (Chien et al. 2021), JKNet (Chen et al. 2020), JacobiConv (Wang and Zhang 2022) and GOAL (Zheng et al. 2023) as baselines. We adopt the random dataset split, consistent across all classes, as established in the GOAL paper (Zheng et al. 2023), using the baseline results reported therein, except for JacobiConv (Wang and Zhang 2022), which was not evaluated in that study and uses a class-specific random split. We re-evaluated JacobiConv using its official implementation and the dataset split specified in GOAL to ensure consistency and comparability. In detail, we randomly split the node set of a dataset according to the ratio \(60\%\)/\(20\%\)/\(20\%\) for training, validation and test set respectively. We randomly generate the split 10 times and run each experiment.

Table 7 Task 2: Semisupervised node classification on the ogbn-arxiv
Table 8 Task 2: Semisupervised node classification on ogbn-proteins
Table 9 Task 2: Semisupervised node classification on ogbn-products

Comparison with classical and decoupled GNNs

AGDN, which offers more expressive graph diffusion than graph convolution, outperforms classical GNNs, such as GCN and GraphSAGE, in semisupervised node classification tasks in Tables 78, and 9. Moreover, the stacking multilayer architecture of AGDN endows it with better expressiveness than that of the decoupled GNN, SIGN, on ogbn-arxiv and ogbn-products. Additionally, AGDN outperforms BernNet on ogbn-arxiv, where learning arbitrary filters may not be as crucial since node features are important and low-pass filters are adequate to capture critical information. The stacking multilayer architecture also facilitates the application of critical tricks, such as BoT, self-KD, and GIANT-XRT, on AGDN in ogbn-arxiv. Remarkably, the AGDN achieves better performance (73.51% test accuracy) than that of other classical and decoupled GNNs, even without these tricks.

Table 10 Task 2: Semisupervised node classification on heterophily-prone datasets

Comparison with spectral GNNs

As validated in Sect. 6.1, AGDNs demonstrate the ability to learn arbitrary filters, enabling strong performance on heterophily-prone datasets. Table 10 summarizes the average accuracy and standard deviation, showing that AGDN consistently outperforms all other spectral GNNs across the three datasets. Notably, AGDN surpasses the baselines by at least \(0.44\%\), highlighting its capability to learn non-low-pass filters effectively. These improvements stem from the adaptive graph diffusion mechanism and the stacking multilayer architecture, which captures diverse structural information and aligns the learned weighting tensor with the target filter characteristics, yielding superior performance.

Comparison with residual GNNs

AGDN demonstrates superior performance compared to that of complex residual GNNs, including GCNII, LEGNN, and SOTA RevGNNs, as shown in Tables 78, and 9, thanks to its adaptive graph diffusion. Furthermore, in Table 11, AGDN consistently outperforms RevGNNs (RevGAT) with similar complexity when critical tricks (BoT, self-KD, and GIANT-XRT) are progressively applied. The AGDN achieves superior performance on all datasets with moderate complexity, leveraging multihop information more effectively with a moderate receptive field and significantly fewer layers than deep residual GNNs. For instance, on ogbn-proteins, compact AGDN (6 layers) outperforms RevGNN-wide (448 layers) with only 13% parameter count, and on ogbn-products, compact AGDN (4 layers) outperforms RevGNN-112 (112 layers) with just 52% parameter count.

Table 11 Comparison with SOTA RevGNN on ogbn-arxiv

6.3 Task 3: link prediction

For link prediction, we evaluate models on ogbl-ppa, ogbl-ddi, and ogbl-citation2. We utilize DeepWalk (Perozzi et al. 2014), Matrix Factorization (Menon and Elkan 2011) (MF), Common Neighbor (Liben-Nowell and Kleinberg 2007) (CN), Adamic Adar (Adamic and Adar 2003) (AA), Resource Allocation (Zhou et al. 2009) (RA), GCN (Kipf and Welling 2017), GraphSAGE (Hamilton et al. 2017), and PLNLP (Wang et al. 2021) as baselines. GNN methods are implemented with the encoder-decoder framework (Kipf and Welling 2016; Hu et al. 2020), where a GNN encoder outputs node representations, and a simple decoder or predictor makes predictions using pairs of node representations as input. We select the best model based on validation scores.

Table 12 Task 3: Link prediction on ogbl-ppa
Table 13 Task 3: Link prediction on ogbl-ddi
Table 14 Task 3: Link prediction on ogbl-citation2

As shown in Tables 12,13 and 14, the performance of deterministic heuristic methods is better on ogbl-ppa but significantly worse on ogbl-ddi and ogbl-citation2 than that of encoder-decoder GNNs. With adaptive graph diffusion, AGDN surpasses some heuristic methods (such as common neighbor and Adamic Adar) on ogbl-ppa.

AGDN significantly outperforms GCN and GraphSAGE on ogbl-ppa, ogbl-ddi, and ogbl-citation2 due to its deeper receptive field. We trained AGDN in a full-batch fashion similar to GCN and GraphSAGE for ogbl-ppa and ogbl-ddi. However, we trained AGDN in a mini-batch fashion by using GraphSAINT for ogbl-citation2, which may negatively impact the final results. Nonetheless, AGDN still outperforms full-batch GCN and GraphSAGE on this dataset. Moreover, AGDN outperforms PLNLP on ogbl-ddi and ogbl-citation2, while it achieves comparable results on ogbl-ppa. Additionally, with learnable node embeddings, AGDN achieves 41.23% Test and 43.32% Validation Hits@100 on ogbl-ppa. This comparison shows that enlarging the receptive field of GNNs can improve link prediction performance, and the adaptive graph diffusion in AGDN can exploit valuable information in the deep neighborhood for link prediction.

6.4 Runtime

Table 15 displays the training and inference runtimes of AGDN, GAT, and RevGNNs on the ogbn-arxiv, ogbn-proteins, and ogbn-products datasets. To account for the extra runtime introduced by generalized graph diffusion, we evaluate GATs by using the same configuration as AGDNs, including the number of layers. In ogbn-proteins, the inference of AGDN is performed on the same GPU, while the inference of RevGNNs is performed on a different Nvidia RTX A6000 (48 GB) without reporting the inference runtime (Li et al. 2021). In ogbn-products, the inference of all models is executed on CPU due to memory constraints. As expected, AGDNs demonstrate reasonable additional runtime for both training and inference. When comparing AGDNs and GATs with the same receptive fields, AGDNs require less time for training and similar time for inference. Moreover, we compare AGDNs with SOTA RevGNNs. RevGAT with only two layers on ogbn-arxiv requires less time for training than that needed for AGDNs. However, with more layers, RevGNNs require more time on ogbn-proteins and ogbn-products than do AGDNs. This suggests that AGDNs can achieve a better balance between accuracy and efficiency.

Table 15 Runtime comparison on Tesla V100 cards

6.5 Parameter analysis

A parameter sensitivity analysis was performed on AGDN by using the ogbn-arxiv dataset. AGDN utilizes similar hyperparameters to other GNNs or neural networks, such as learning rate, weight decay, and dropout. In this analysis, we primarily focused on the hyperparameter of the receptive field, a central aspect of this work. For AGDN, the receptive field (LK) can be expanded either by increasing the layer count (L) or the diffusion depth (K).

Table 16 presents the outcomes with various combinations of L and K. The optimal combination was identified as \(L=3,K=2\). Some combinations led to exceeding the maximum memory (16 GB) and hence were omitted from the report. The results reveal a general trend of improved model performance when either L or K or both are increased. Expanding both L and K seems to enhance the receptive field more effectively than typical GNNs, without introducing excessive complexity or triggering oversmoothing. By comparing the significant difference between 1-layer cases and others in Table 16, it is evident that a minimum 2-layer architecture is required for nice expressiveness.

To balance efficiency and accuracy, an optimal range for L and K is between 2 and 3. Therefore, most experiments across various tasks and datasets in this paper incorporate combinations within this range.

Table 16 Parameter analysis for the number of layers and diffusion depth on ogbn-arxiv

6.6 Ablation study

Nodewise, channelwise

In this paper, we introduce nodewise and channelwise diffusion weights associated with HA and HC, respectively, which is a major contribution. An ablation study was conducted on the first learning filters task to demonstrate their effectiveness. To remove nodewise and channelwise characteristics for HA and HC, respectively, weights were computed by using the same average node representation for each node (HA) and the same convolution kernel for each channel (HC). Table 17 shows that channelwise weights can consistently improve AGDN-HC performance. Moreover, nodewise weights can dramatically enhance the ability of AGDN-HA to learn all filters, especially nonlow-pass filters, overcoming the limitation discussed in Subsect. 5.2.

Stacking multilayer architecture

We investigated the impact of stacking multilayer architecture on AGDN by implementing AGDN-HA and AGDN-HC as decoupled models, each consisting of a single AGDN layer followed by linear layers. The only change in hyperparameters was setting \(K=10\) for consistent receptive fields. The results in Table 17 demonstrate that AGDN-HA and AGDN-HC suffer from significant performance degradation when used as single layers. Stacking multiple AGDN layers with variable filters allows for more flexible capture of target information, as discussed in Subsect. 5.3.

Positional embeddings

The importance of PEs in AGDN-HA was discussed in Corollary 6. These embeddings play a crucial role in ensuring that the AGDN-HA can theoretically produce any value, which is particularly important in learning nonlow-pass filters. To investigate this further, we conducted an ablation study on the learning filters task. The results, shown in Table 17, indicate that AGDN-HA performs significantly worse without PEs, especially on nonlow-pass filters. However, since AGDN-HC is already theoretically capable of learning arbitrary filters, PEs appear to be redundant for AGDN-HC. As a result, AGDN-HC performs slightly better without PEs, except in the cases of bandpass and band-rejection filters.

Table 17 Ablation study for nodewise weights, channelwise weights and stacking multihop architecture on 50 images

AGDN versus GAT

In Tables 18, we present an ablation study in which the performance of GAT and AGDN is compared on six OGB datasets. For ogbn-arxiv, we disabled additional techniques such as BoT, self-KD, and GIANT-XRT while maintaining consistent settings for both models. AGDN demonstrated superior performance over GAT on all datasets, highlighting the effectiveness of adaptive graph diffusion in AGDN.

Table 18 Ablation study on six OGB datasets

Oversmoothing

We conducted experiments to compare AGDN variants with the various base models (GCN, GraphSAGE, GAT) and diffusion depths K (\(K=1,2,...,8\)) on the ogbn-arxiv dataset by using the official OGB implementation. We also used classical GNNs (MPNNs) with equivalent receptive fields as baselines, where for a 3-layer AGDN model with \(K=4\), the associated MPNN baseline has \(3\times 4=12\) layers. Figure 4 shows that the MPNN baseline curves, especially for GAT, exhibit a distinct oversmoothing problem. AGDN-mean shows a quickly rising but significantly dropping curve, which is also affected by oversmoothing. In contrast, AGDN-HA and AGDN-HC show much more stable curves. AGDN-HA has similar optimal results to AGDN-mean, while AGDN-HC cannot outperform shallow baseline models. AGDN-HC directly learns weights that can take any values, which may be difficult to optimize on ogbn-arxiv. These results demonstrate that adaptive graph diffusion (HA or HC) can effectively alleviate the oversmoothing problem.

Fig. 4
figure 4

Comparison among AGDN-Mean, AGDN-HA, AGDN-HC, and related base models with diffusion depths on ogbn-arxiv. For MPNN baselines, the “Receptive field” is the number of layers (L). For AGDN, the “Receptive field” is the product of the number of layers and diffusion depth (LK)

7 Discussion

In this work, we introduced adaptive graph diffusion networks (AGDNs) to tackle challenges posed by deep GNNs in numerous applications. The approach adopted by AGDNs involves learning nodewise (HA) and channelwise (HC) graph diffusion with a stacked multilayer architecture. This capability empowers them to effectively learn arbitrary filters in the spectral domain, thereby discerning intricate patterns within graph data.

Through this methodology, AGDNs have shown improved performance in tasks such as node classification and link prediction. This improvement places AGDNs as an alternative that strikes a better balance between complexity and performance when compared to those of complex residual GNNs and oversimplified decoupled GNNs.

However, despite considerable progress, certain areas warrant further research. The spectral characteristics of AGDN-HA, for instance, need additional exploration. Moreover, it is essential to assess the performance of AGDNs in more real-world applications to fully understand their practical implications and limitations.

8 Conclusion

In this paper, we present AGDNs to address the challenges faced by deep GNNs in various applications. AGDNs leverage the power of learnable nodewise (HA) and channelwise (HC) graph diffusion and the stacking multilayer architecture, enabling them to learn arbitrary filters in the spectral domain and to capture complex patterns and relationships in graph data. This approach leads to improved performance on tasks such as node classification, both on homophily-prone and heterophily-prone datasets, as well as link prediction. AGDNs offer a better balance between complexity and accuracy than those of complicated residual GNNs and oversimplified decoupled GNNs. They present a promising solution for improving the performance of GNNs on a range of graph-based applications. The spectral characteristics of AGDN-HA remain an area for future research, and the efficiency of the hopwise attention mechanism can be further optimized. Future work will focus on enhancing the spectral interpretability and efficiency of AGDN-HA, as well as exploring more effective positional embeddings.