research-article

Public Access

Open-World Graph Active Learning for Node Classification

Authors:

Hui Xu,

Chenghu ZhouAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 2

Article No.: 43, Pages 1 - 20

https://doi.org/10.1145/3607144

Published: 14 November 2023 Publication History

PDF eReader

Abstract

The great power of Graph Neural Networks (GNNs) relies on a large number of labeled training data, but obtaining the labels can be costly in many cases. Graph Active Learning (GAL) is proposed to reduce such annotation costs, but the existing methods mainly focus on improving labeling efficiency with fixed classes, and are limited to handle the emergence of novel classes. We term the problem as Open-World Graph Active Learning (OWGAL) and propose a framework of the same name. The key is to recognize novel-class as well as informative nodes in a unified framework. Instead of a fully connected neural network classifier, OWGAL employs prototype learning and label propagation to assign high uncertainty scores to the targeted nodes in the representation and topology space, respectively. Weighted sampling further suppresses the impact of unimportant classes by weighing both the node and class importance. Experimental results on four large-scale datasets demonstrate that our framework achieves a substantial improvement of 5.97% to 16.57% on Macro-F1 over state-of-the-art methods.

1 Introduction

Graph Neural Networks (GNNs), achieving a significant success in recent years, lie at the heart of many real-world applications [6, 9, 10, 18, 19, 28]. Despite its great power, GNNs rely on a large number of high-quality annotations to obtain good performance [1, 7, 25], necessitating a labeling process, which could be time-consuming or costly in some domains. Graph Active Learning (GAL) [1] aims at reducing such labeling costs by dynamically querying the labels of the most informative nodes from the unlabeled ones on graph during GNN training. However, existing GAL methods mainly focus on a “closed-world” setting, where all nodes belong to a fixed known group of classes. As a result, the importance of the nodes in new classes may not be correctly evaluated in the labeling nodes selection.

Open-world learning [20] is a paradigm where new classes/trends constantly emerge and models have to recognize nodes of seen and unseen classes. In the open-world scenario, a graph develops over time as new nodes, edges, and classes are introduced, and thus GAL is required to make a fair selection among all nodes for labeling, despite its class has been seen or not. Ideally, nodes of new classes are selected to be annotated so that GNNs can better generalize to the evolving graph. Hence open-world graph active learning (OWGAL) should not only select the most informative nodes, but uncover nodes of latent new classes, as depicted in Figure 1.

Fig. 1.

Fig. 2.

Unfortunately, existing GAL methods are incapable of extending to the open-world case. The recent proposed GAL approaches [1, 4, 7, 24, 25, 26] calculate importance metrics over the GNN classifier and select nodes based on the metrics. Typical metrics include uncertainty [1], representativeness [4], and information gain [7], all of which are based on the node representations. To capture the topology information, some works [25, 26] take the influence between the nodes to be labeled and their neighboring nodes into account to incorporate the local structural information. However, these mainstream metrics pay little attention to nodes of unknown classes, which are often missed in the top-rated nodes by the current GAL methods. As we verify in Figure 2 which gives the scores of the top-10 unlabeled nodes of each class in the first round of active learning, the conventional metrics tend to assign lower scores to nodes of unknown classes, posing great challenges to select nodes to annotate merely by metrics.

Fig. 3.

As we dived deeper, we made two further observations with respect to nodes of unknown classes. First, we found that unknown-class nodes typically have a larger distance to the labeled nodes on graph. For unlabeled node \(v_i\), we set its distance to the set of labeled nodes as \(d_{i}=\min _{j\in \mathcal {V}^L}sp(v_i, v_j)\), where \(\mathcal {V}^L\) is the labeling set and \(sp\) is the length of the shortest path between \(v_i\) and \(v_j\). The class average of \(d_i\) is depicted by the left part of Figure 3. And it is obvious that nodes of unknown classes have larger distances to all labeled nodes, consistent with the homophily assumption that a node is more likely to link with nodes of the same class. Second, we discovered that node representations of unknown classes are significantly intertwined with those of known classes. As shown in the right part of Figure 3, novel-class nodes are more likely to be away from any known class cluster in the embedding space and are clustered in the central part of the figure. Due to the limited representation ability of the model, novel-class nodes significantly overlap with known-class nodes, rendering nodes of unknown classes almost indistinguishable.

Hence, lessons are learned that: (1) it is important to adjust the geometric distance in the representation space to clearly separate the known-class nodes from unknown-class ones; (2) unknown class nodes typically have a larger distance to labeled ones on graph due to the homophily property. Therefore, we carve out representation space for the unknown-class nodes by making the representations from known classes more compact, so that unlabeled nodes are distant from any labeled nodes in the representation space, with higher uncertainty scores. Additionally, any node with a larger distance to labeled nodes may suggest a novel class.

Based on the lessons, we propose a new OWGAL framework (referred by the same name) in this article. To make labeled nodes more compact, we introduce geometric-based prototype learning [3], where each known class is represented by a high-level prototype vector. We cluster labeled nodes and push known classes away from each other to enhance the uncertainty score of novel-class nodes. Meanwhile, Label Propagation (LP) is introduced to capture the nodes away from the labeled ones in topology space. The two techniques are not only effective in discovering novel-class nodes, but also boundary nodes. Prioritizing those nodes to label is beneficial to improving the overall performance. Moreover, some novel classes have a small number of nodes, and spending too much labeling cost on these classes will lead to poor overall performance due to the limited budget. However, these classes are typically challenging to categorize accurately, producing a very high uncertainty score. If we only label nodes with the highest uncertainty score, the OWGAL model will favor these classes, preventing the discovery of other novel classes. To avoid over-preferring these less important classes, we propose to leverage weighted sampling as opposed to choosing top-rated nodes.

Highlights of our contributions are as follows:

—

To the best of our knowledge, our work is among the first studying the OWGAL problems.

—

We introduce a framework featured by prototype learning, unparameterized LP and weighted sampling, improving GNN’s capability in selecting key nodes to label, especially those of unseen classes. Our labeling strategy could also distinguish important classes from unimportant ones, enhancing the GNN performance overall.

—

We conduct extensive experiments on four real-world large-scale graphs to demonstrate the superior performance of our method. OWGAL improves the Macro-F1 by 5.97% to 16.57% over the state-of-the-art baselines.

2 Related Work

2.1 Open-World Learning

Open-world learning is also known as the open-set learning/recognition, which identifies classes the learner has seen before as well as detects new classes has never seen before. The problem has attracted much attention in Natural Language Processing [21, 22] and Computer Vision [2, 5, 11, 13, 17, 29]. Proser [29] uses a dummy classifier to generate dynamic thresholds and employs mix-up to generate virtual samples of novel classes. OW-DETR [5] models the probability distribution of unknown classes by adding an extra dimension to the output, and selects the most probable candidate queries through an attention-based pseudo-labeling mechanism. Compared to these methods, we do not identify all novel classes at once, but gradually find potential novel classes during the labeling process.

Among the first efforts to resolve the problem on graph, OpenWGL [20] proposes an uncertain node representation learning approach based on variational graph autoencoder, which is good at recognizing nodes within unseen classes. The Out-of-Distribution (OOD) detection is also highly related to OWL, as the nodes of unknown classes can be viewed as the OOD data. GKDE [27] presents a network for uncertainty-aware estimation to predict the Dirichlet distribution of nodes and to detect the OOD data. GPN [16] utilizes the graph posterior network to predict various types of uncertainty associated with each node. However, none of these methods take annotation into account; nodes of novel classes are simply classified as “unknown” and rejected in testing.

2.2 Graph Active Learning

The goal of GAL is to select the most valuable nodes to label to reduce labeling efforts. A series of strategies have been proposed: AGE [1] and ANRMAB [4] exploit heuristic metrics to measure the value of each node, including uncertainty, information density, and graph centrality. Further, the reception field of each labeling node is also considered [24]. However, these metrics draw on the property of the node, neglecting relation between labeling nodes and their neighbors, embodied in the message passing of GNNs. To overcome the drawback, GRAIN [26] and IGP [25] utilize the feature influence score across nodes to select the labeling nodes via diversified influence maximization and information gain propagation maximization, respectively. Without specific query strategy, GPA [7] adopts reinforcement learning to discover the optimal combination of various strategies. However, current GAL approaches share the drawback of being insensitive to novel classes and hence not effective in OWGAL.

3 Preliminaries

We first define the problem of the open-world GAL, and then introduce a brief background on GNNs and prototype representation. We denote the graph by \(\mathcal {G}=(\mathcal {V}, \mathcal {E}, X)\), where \(\mathcal {V}=\lbrace v_1, v_2, \ldots , v_n\rbrace\) is the set of nodes, \(\mathcal {E}\) is the set of edges and \(X=\lbrace \boldsymbol {x}_1, \boldsymbol {x}_2, \ldots , \boldsymbol {x}_n\rbrace \in \mathbb {R}^{n \times d}\) denotes the node feature matrix with dimension \(d\). For node classification, the prediction targets are \(Y=\lbrace \boldsymbol {y}_1, \boldsymbol {y}_2, \ldots , \boldsymbol {y}_n \rbrace \in \mathbb {R}^{n \times k}\), where \(\boldsymbol {y}\) is a \(k\)-dimensional one-hot vector. The group of all classes is denoted as \(C=\lbrace c_i|i=1,\ldots ,k\rbrace\). The set of classes that not have been observed is defined as the “unknown” class \(C_{u} \subseteq C\), whereas the remaining classes are collectively denoted as “known” classes \(C_{n}=C \backslash C_u\).

3.1 Problem Definition

At a given stage of graph evolution, we conduct a new round of active learning. Given current graph \(\mathcal {G}=(\mathcal {V}, \mathcal {E}, X)\) with \(C_n\), initial training set \(\mathcal {V}_{train}\) on \(C_n\), a labeling budget \(\mathcal {B}\), and a loss function \(l\), the goal of OWGAL is to select a subset of nodes \(\mathcal {V}_l \subset \mathcal {V} \backslash \mathcal {V}_{train}\) that produces a model \(f\) with the lowest loss on the rest nodes \(\mathcal {V}_{test}\):

\begin{equation} \mathop {\arg \min }_{\mathcal {V}_l:|\mathcal {V}_l|=\mathcal {B}} \mathbb {E}_{v_i\in \mathcal {V}_{test}}[l(\boldsymbol {y}_i, \tilde{\boldsymbol {y}}_i)] \end{equation}

(1)

where \(\tilde{\boldsymbol {y}}_i\) is the label prediction with \(f\) of node \(v_i\).

Fig. 4.

3.2 Prototype Representation

The initial proposal for prototype representation appears in Ref. [15], where one class is represented by the mean embedding of its support nodes. The embeddings of nodes of a particular class tend to cluster around the representation of the class prototype in metric space. The training data used to calculate the prototype representation is denoted as the support set \(S\), while the remaining training data is referred to as the query set \(Q\).

4 Method

We first give an overview of our OWGAL framework in Figure 4. A salient feature of OWGAL is that, instead of using a GNN classifier, OWGAL “classify” nodes by seeking the closest prototype representation and claims the node belongs to that class. Meanwhile, unparameterized LP helps to expose novel classes in the topology space. Done learning, OWGAL computes the importance score of each unlabeled node, on which weighted sampling is further performed to finally select the labeling nodes. The above selection and labeling process are repeated until the labeling budget is consumed.

4.1 Metric-Based Prototype Learning

In open-world scenario, novel node classes emergence as the graph evolves. Since unknown classes have no labeled data at the out-set, the nodes distribution overlaps with that of the known classes, making identification difficult. To make known and unknown classes more identifiable, we learn more compact representations for nodes in the same class, and push known classes as far away from each other as possible and letting their projections to distribute evenly on a unit sphere. Due to the minor but crucial feature discrepancy, unknown class nodes will be harder to include in any cluster of known classes, increasing the uncertainty of unknown class node predictions. To achieve this, instead of GNN classifier, OWGAL propose to learn the prototype representation from geometric relationships.

Specifically, we first employ the graph neural network \(f_{\mathcal {G}}\) to get node representation \(\boldsymbol {z}_i=f_{\mathcal {G}}(\boldsymbol {x}_i)\) for each node \(v_i\). We randomly sample a small set of training nodes for each known class to compose the support set \(S\), and the prototype representation of each known class is

\begin{equation} \boldsymbol {p}_i=\sum _{v_j \in S_i}\frac{\mathrm{exp}(\alpha _j)}{\sum _{v_k \in S_i}\mathrm{exp}(\alpha _k)}\boldsymbol {z_j}, \end{equation}

(2)

where \(S_i\) is the support set of known class \(c_i\), and \(\alpha _j\) is the pagerank score of node \(v_j\). Since each node has different impact to the prototype, we use the topology centrality to evaluate the importance of support node \(v_j\) to prototype \(\boldsymbol {p}_i\). The prototype representation \(\boldsymbol {p}_i\) can be viewed as the cluster center of the class \(c_i\) in representation space, and each node of \(c_i\) should be around \(\boldsymbol {p}_i\) closely. Thus, to classify the node \(v_j\), a probability is assigned for each class \(c_i\) based on the distance between node \(v_j\) and prototype representation \(\boldsymbol {p}_i\):

\begin{equation} p(c_i|v_j) = \frac{\mathrm{exp}(-d(\boldsymbol {z_j}, \boldsymbol {p}_i))}{\sum _{c_k \in C_{n}} \mathrm{exp}(-d(\boldsymbol {z_j}, \boldsymbol {p}_k))}, \end{equation}

(3)

where \(d(\cdot , \cdot)\) is the Euclidean distance, \(C_n\) is the group of known classes. To let nodes of the same class have compact representations and the model can “classify” nodes by assigning each node its nearest prototype, we minimize the intra-class loss

\begin{equation} \mathcal {L}_p = \frac{1}{|C_n||Q_i|}\sum _{i=1}^{|C_{n}|}\sum _{v_j\in Q_i}-\log p(c_i|v_j), \end{equation}

(4)

where \(p(c_i|v_j)\) is the prediction probability of node \(v_i\) on ground-truth \(c_i\), and \(Q_i\) is the query set of known class \(c_i\), i.e., remaining training nodes.

To enhance the distinguishability of the unknown-class nodes from the known-class nodes, OWGAL employs two different metrics simultaneously, i.e., Euclidean distance and Cosine distance. Given a prototype representation \(\boldsymbol {p}_i\) of class \(c_i\), we first maximize the Euclidean distance from \(\boldsymbol {p}_i\) to its closest prototype by minimizing the loss:

\begin{equation} \mathcal {L}_{e} = \frac{1}{|C_n|}\sum _{c_i\in C_n} \max \limits _{c_j\in C_n\backslash c_i}\mathrm{exp}(-d(\boldsymbol {p}_i, \boldsymbol {p}_j)), \end{equation}

(5)

where \(d(\cdot)\) is the Euclidean distance. Then, we maximize the cosine distance between \(\boldsymbol {p}_i\) and its closet prototype by minimizing the loss:

\begin{equation} \mathcal {L}_{c} = \frac{1}{|C_n|}\sum _{c_i\in C_n} \max \limits _{c_j\in C_n\backslash c_i}\left(\frac{\boldsymbol {p}_i-\boldsymbol {p}_c}{||\boldsymbol {p}_i-\boldsymbol {p}_c||}\cdot \frac{\boldsymbol {p}_j-\boldsymbol {p}_c}{||\boldsymbol {p}_j-\boldsymbol {p}_c||}\right) + 1, \end{equation}

(6)

where \(\boldsymbol {p}_c=\frac{1}{|C_n|}\sum _{c_i\in C_n}\boldsymbol {p}_i\) is the selected geometric center, and the constant 1 ensures the loss is non-negative. By increasing the distances between known-class nodes, we let the known classes’ prototypes spread out as much as possible, and pushing them away from each other while leaving the unknown-class nodes apart. Thus, the unknown-class nodes are more distinguishable from the known-class ones. To sum up, the total loss is defined as

\begin{equation} \mathcal {L}= \mathcal {L}_{p} + \lambda (\mathcal {L}_{e}+\mathcal {L}_{c}). \end{equation}

(7)

And \(\lambda\) is a hyperparameter to control the balance between intra-class and inter-class learning. The detailed procedure is given in Algorithm 1.

4.2 Label Propagation

Except for the prediction uncertainty in the metric space, the graph topology and label information are also significant to discover the latent novel classes. Motivated by the homophily assumption that adjacent nodes tend to have the same label, we exploit the intuition that node \(v\) is more likely to belong to a novel class if no labeled node reach it within \(K\) steps of random walks on graph. Hence, we propagate the known labels along edges on the graph to unlabeled nodes [30]. Specifically, we propagate labels iteratively by

\begin{equation} \hat{\boldsymbol {y}}^{(k+1)}_v = \frac{1}{|\mathcal {N}_v|+1}\sum _{u\in \mathcal {N}_v\cup \lbrace v\rbrace }\hat{\boldsymbol {y}}^{(k)}_u, \end{equation}

(8)

where \(\hat{\boldsymbol {y}}_v^{(k)}\) is the soft label of node \(v\) in the \(k\)th iteration, and \(\mathcal {N}_v\) is the neighbor set of node \(v\). For initialization, we set a uniform label distribution for each unlabeled node and a one-hot label vector for the labeled node:

\begin{equation} \hat{\boldsymbol {y}}_v^{(0)}=\left\lbrace \begin{aligned}(0,\ldots ,1,...0)& , & \forall v\in \mathcal {V}_t, \\ \Big (\frac{1}{|C_n|},\ldots ,\frac{1}{|C_n|},...&\frac{1}{|C_n|}\Big) , & \forall v\notin \mathcal {V}_t \end{aligned} \right. \end{equation}

(9)

where \(\mathcal {V}_t\) is the labeled node set, and we denote the final propagation output of node \(v\) as \(\hat{\boldsymbol {y}}_v=\mathrm{softmax}(\hat{\boldsymbol {y}}_v^{(K)})\). The unlabeled nodes at least \(K\) steps from any labeled nodes would not be affected by labeled nodes, and thus have the highest uncertainty score with a uniform distribution.

4.3 Node Selection and Model Training

At each labeling round, we need to compute the importance score of each node and select a batch of nodes to annotate. Given the soft label \(\bar{\boldsymbol {y}}_i\) of node \(v_i\) produced in prototype learning where \(\bar{\boldsymbol {y}}_{i,j}=p(c_j|v_i)\), and the LP result \(\hat{\boldsymbol {y}}_i\), we calculate the importance score of node \(v_i\) by

\begin{equation} s_{n}(v_i) = \gamma \cdot H(\bar{\boldsymbol {y}}_i) + (1-\gamma) \cdot H(\hat{\boldsymbol {y}}_i), \end{equation}

(10)

where \(H\) is the entropy and \(H(\boldsymbol {a})=-\sum _{i=1}^{k}a_{i}\mathrm{log}a_{i}\). Through Equation (10), OWGAL unifies the discovery of unknown classes and important nodes, as it weighs unseen-class nodes as much as the nodes near the decision boundary, both of which have high uncertainty scores in our framework.

It is straightforward to select the top-rated nodes given the importance score by Equation (10). However, the selection depending on pure scores would favor abnormal classes with few nodes, which we consider “unimportant”, resulting in budget waste. Hence we propose to consider the class scores besides each node score, and the score is defined as \(s_c(c_i)=\sum _{v_i\in \mathcal {V}_{i}}s_n(v_i)\) for class \(c_i\), where \(\mathcal {V}_{i}\) is the node set of class \(c_i\). We expect nodes in the class with a higher \(s_c\) are more likely to be selected, which helps reducing the bias toward “unimportant” classes. As \(s_c\) is only an estimation, we turn the selection process into a randomized one by weighted sampling. The weight of \(v_i\) is

\begin{equation} w(v_i) = \frac{\mathrm{exp}(s_n(v_i)/\tau)}{\sum _{v_k\in \mathcal {V}}\mathrm{exp}(s_n(v_k)/\tau)}, \end{equation}

(11)

where \(\tau \le 1\) is the temperature parameter. Therefore, the probability of selecting class \(c_i\) is \(p(c_i)=\sum _{v_i\in \mathcal {V}_{i}}w(v_i) \propto s_c(c_i)\). Furthermore, we can adjust the temperature parameter \(\tau\) to control the smoothness of weight distribution.

After each round of labeling, the GNN model is retrained by minimizing the loss Equation (7). We randomly choose some training nodes for each class \(c_i\) as support set \(S_i\) and other training nodes compose the query set \(Q_i\). The prototype representation of each class is computed by training nodes with a similar size. Algorithm 2 illustrates the pseudo code of the entire training procedure.

4.4 Complexity Analysis

We analyze the complexity of our approach by modules: the classification model and the node importance measure in each round. (1) Instead of GNN classifier, prototype learning serves as the base classifier for OWGAL. The time complexity of prototype learning is \(\mathcal {O}(LNd^2+LEd+kE)\), where \(L\) means the number of GNN layers, \(N\) denotes the number of nodes, \(d\) is the dimension of input feature, \(E\) is the number of edges and \(k\) is the number of iterations for pagerank. Compared to the \(\mathcal {O}(LNd^2+LEd)\) time complexity of GCN, prototype learning almost does not incur additional overhead. Meanwhile, prototype learning has the same \(\mathcal {O}(LNd+Ld^2)\) space complexity to GCN; (2) The node importance measure in each round costs \(\mathcal {O}(KEC+NC)\) time complexity, with \(C\) the number of classes and \(K\) the depth of LP. Such importance measure is more efficient than the most lightweighted baseline—AGE with time complexity \(\mathcal {O}(NC+mNCd)\), where \(m\) is the number of iterations in the clustering process. The space complexity of OWGAL is \(\mathcal {O}(Nd+KNC+E)\), which is slightly larger than AGE with \(\mathcal {O}(Nd+NC+E)\). Other baselines such as IGP have a time complexity of \(\mathcal {O}(NLd^2)\) and a space complexity of \(\mathcal {O}(Nd+Ld^2)\), much larger than OWGAL.

5 Experiment

5.1 Setup

5.1.1 Datasets.

To mimic the real-world applications, we evaluate OWGAL on six datasets: Amazon-Computer, Amazon-Photo, Coauthor-CS, Coauthor-Physics, Arxiv, Reddit, and Products. We summarize the statistics of these datasets in Table 1.

Table 1.

Dataset	#Nodes	#Edges	#Features	#Classes	#K	#Setting
A-Photo	7,650	238,162	745	8	3	Transductive
A-Computer	13,381	245,778	767	10	3	Transductive
Coauthor-CS	18,333	163,788	6,805	15	3	Transductive
Coauthor-Physics	34,493	495,924	8,415	5	2	Transductive
Arxiv	169,343	1,166,243	128	40	10	Transductive
Reddit	232,965	11,606,919	602	41	16	Inductive
Products	2,449,029	61,859,140	100	47	12	Inductive

Table 1. Statistics of Datasets Used in the Experiments

—

Coauthor-CS and Coauthor-Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Nodes represent authors who are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers; and class labels represent each author’s most active fields of study.

—

Amazon-Photo and Amazon-Computers [14] are extracted from Amazon co-purchase graph. Nodes represent the goods and edges represent that two goods are frequently bought together. Node features are product reviews encoded by the bag-of-words, and each node is assigned with a predefined product category.

—

Arxiv [8] is also a citation network between all Computer Science (CS) arXiv papers indexed by MAG. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Besides the 128-dimensional feature vectors, the published year of paper is also provided. The label is the subject areas of the paper.

—

Reddit [23] is an online network. The nodes are the posts on Reddit where two posts are connected if the same user comments on both. The label is the community that a post belongs to. And off-the-shelf 300-dimensional word vectors are used as the node features.

—

Products [8] is the Amazon product co-purchasing network. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. Node features are the bag-of-words extracted from the product descriptions, and then followed by a Principal Component Analysis.

5.1.2 Baselines.

We compare OWGAL with the following GAL baselines: Random, AGE[1], ANRMAB[4], GPA[7], ALG[24], IGP-hard[25], AGE+OpenWGL, AGE+GPN. To adapt to the open-world setting, while new class nodes are discovered, the output dimension of GNN classifier in each baseline are expanded to the number of currently known classes. For Random, we use the GNN classifier as backbone and annotate the nodes in each round randomly. For AGE, ANRMAB and ALG, we follow the official implementation. To scale ALG to the large-scale graph, we perform the same efficiency optimization as in its article. For GPA, we choose the model pre-trained on Reddit for all datasets. The native implementation of IGP does not identify novel classes as its oracle only returns a binary answer, so we implement IGP-hard by replacing the binary answer with the ground-truth label. In addition, as no prior works deal with the open-world setting, we implement two variants: AGE+OpenWGL and AGE+GPN. OpenWGL and GPN are two open-world learning algorithms on graphs designed to distinguish nodes from unknown classes. They can calculate a score for each node, with the score increasing if the model believes the node is more likely to belong to the unknown class. We combine these two methods with AGE by selecting labeling nodes as those with the topmost summed scores. Scores evaluated by the two parts are normalized.

5.1.3 Experiment Setting.

We use Micro-F1 and Macro-F1 as the evaluation metrics. For each dataset, we split all classes into known and unknown classes according to the number of nodes contained. Mimicking the real-world scenario, we assign \(K\) classes with the most nodes as known classes, as newly emerged classes usually have fewer nodes in total. The choice of \(K\) is given in Table 1.

We evaluate both the transductive and inductive settings. In the former, all nodes on the graph are observed in training. The nodes labeled by GAL algorithms are training/validation nodes, and the remaining unlabeled ones are all denoted as testing nodes. Note that we pre-defines no validation or test set in transductive setting, making the AL case more challenging but realistic. For inductive setting, we use the conventional training/validation/test split, but do not use the validation set as it contains vast labeled nodes. Nodes from the training set are selected for labeling and a part of the labeled nodes are used to adjust hyperparameters in training.

On each dataset, we randomly set 20 initial training nodes for each known class as the past labeled nodes. GCN serves as the backbone for the transductive learning and SAGE as the basis for inductive learning. Since the prediction on Arxiv and Products are more difficult, we adopt a three-layer GNN on them and a two-layer GNN for the rest datasets. The hidden dimension of each layer is set to 128 for Amazon-Computer, and 256 for the rest. Hyperparameters of OWGAL are set as \(\lambda =0.1\) and \(\tau =0.1\), respectively. Adam optimizer is employed to train our model and the learning rate is searched from \(\lbrace 0.001, 0.005, 0.01\rbrace\). All models are implemented on a server with two CPUs (Intel Xeon E5-2630 \(\times\) 2) and four GPUs (NVIDIA GTX 2080 \(\times\) 4).

5.2 Performance Comparison

Following prior works [7, 24, 25], we evaluate the performance of GAL from two aspects: (1) the overall performance when all active learning budgets are consumed; (2) the performance gain of the newly labeled nodes in each active learning round.

Table 2.

5.2.1 End-to-End Comparison.

To evaluate the overall performance of all GAL methods, we choose 20 labeling budget per unknown class following prior works and report the results after all labeling budget consumed. To mitigate results randomness, we run OWGAL and baselines five times with different random seeds. As shown in Table 2, we have the following observations: (1) Random sampling can beat most of the GAL baselines on both Micro-F1 and Macro-F1 scores. This is because random sampling selects nodes to label from all classes according to data distribution while GAL methods prefer known classes. While data is not extreme class-imbalanced, Random is more likely to sample the novel-class nodes compared to GAL; (2) OpenWGL boosts the performance of AGE on all datasets except for Products, and AGE+GPN is consistently inferior to AGE. This indicates that combining GAL with open-world learning is promising yet sub-optimal, as they do not amplify the feature discrepancy between known and unknown classes; (3) The proposed OWGAL achieves the best performance over all datasets under different evaluation metrics, showing the advantage of our strategies.

5.2.2 Performance under Different Budgets.

Budget is gauged at the unit of a labeling sample. To evaluate the performance of newly labeled nodes of each active learning round, we record the performance with the increment of budgets. Hence the performance increment between the \(t\)th and the \((t+1)\)th round reflects the quality of the newly labeled nodes of the \((t+1)\)th labeling round. We mainly report the results on Amazon-Computer and Arxiv, and similar results can be observed on the other datasets. As shown in Figure 5 and Figure 6, OWGAL is significantly superior to baselines on both Micro-F1 and Macro-F1 over different budgets. For Amazon-Computer, compared to the performance of Random with 140 budgets, OWGAL nearly saves 60–70% of budgets and can significantly reduce the labeling cost. AGE+OpenWGL is the most competitive method to OWGAL as it is inferior to random sampling with a small budget on Amazon-Computer. Random strategy is consistently better than most of the other AL methods across different budgets in OWGAL setting since it considers data distribution. For Arxiv, with the increase of budgets, Random is gradually inferior to the other GAL methods and achieves nearly 69.47% Micro-F1 and 47.74% Macro-F1 on 10,000 budget. For a similar level of Micro-F1 and Macro-F1 scores, OWGAL nearly saves by 60–70% of budgets compared to the Random baseline. AGE+OpenWGL performs better than other baselines at a low level of budget but does not improve its F1 score at a higher budget. This is because that AGE+OpenWGL has a poor balance between the active learning task and the discovery of novel classes.

Fig. 5.

Fig. 6.

To demonstrate that OWGAL is more sensitive to unknown classes than baselines, we also show the number of observed classes over different labeling budgets in Figure 7. We can clearly see that OWGAL can detect the unknown class more efficiently, as it only needs a budget quantity 40 to find all classes on Amazon-Computer. Meanwhile, OWGAL can detect the whole 40 classes of Arxiv while no baseline can achieve this within the given budget. The results of AGE+OpenWGL and AGE+GPN indicate that combining AGE with existing open-world learning algorithms have limited enhancement in the model’s sensitivity to unknown classes.

Fig. 7.

Fig. 8.

5.2.3 Performance under Various Known Classes.

In general, given a constant number of classes, the greater the number of known classes, the easier it is to solve the OWGAL issue. To analyze the model’s sensitivity to the number of known classes, we change the number of known classes with a fixed budget 140 for Amazon-Computer and 600 for Arxiv, and the results are reported in Figure 8. All methods benefit from an increase in the number of known classes, and GAL methods eventually outperform Random as all classes are known, which is consistent with prior works. IGP-hard is very sensitive to the number of known classes as information gain prefers the nodes within known class.

5.2.4 Performance under Closed-World Setting.

Beyond the open-world setting, we further demonstrate the effectiveness of OWGAL under the closed-world setting where all classes are known and a total of \(20|C|\) budget is allocated. As shown in Table 3, most traditional GAL methods enjoy a better performance compared to that in open world (Table 2). OWGAL is constantly high across all circumstances, demonstrating that our framework well adapts to the closed-world setting.

Table 3.

Datasets	Metric	Random	AGE	ANRMAB	GPA	ALG	IGP-hard	\(\mathrm{\mathbf {AGE}_{\mathbf {OpenWGL}}}\)	\(\mathrm{\mathbf {AGE}_{\mathbf {GPN}}}\)	OWGAL
Amazon-Computer	Micro-F1	84.13 \(\pm\) 1.53%	85.21 \(\pm\) 2.02%	84.91 \(\pm\) 1.73%	82.88 \(\pm\) 1.38%	86.32 \(\pm\) 1.94%	82.64 \(\pm\) 2.26%	86.02 \(\pm\) 0.53%	82.06 \(\pm\) 2.58%	87.24 \(\pm\) 1.05%
Amazon-Computer	Macro-F1	84.48 \(\pm\) 2.48%	83.25 \(\pm\) 2.88%	82.73 \(\pm\) 2.39%	79.73 \(\pm\) 3.67%	85.03 \(\pm\) 5.23%	82.07 \(\pm\) 1.34%	85.07 \(\pm\) 0.87%	82.98 \(\pm\) 3.68%	86.32 \(\pm\) 2.21%
Arxiv	Micro-F1	62.88 \(\pm\) 0.61%	63.18 \(\pm\) 0.22%	62.92 \(\pm\) 0.48%	63.39 \(\pm\) 0.23%	62.70 \(\pm\) 0.45%	61.55 \(\pm\) 2.87%	63.20 \(\pm\) 0.27%	62.34 \(\pm\) 0.35%	64.07 \(\pm\) 0.58%
Arxiv	Macro-F1	42.74 \(\pm\) 1.03%	43.24 \(\pm\) 1.27%	43.34 \(\pm\) 2.08%	41.17 \(\pm\) 0.73%	40.65 \(\pm\) 1.17%	40.48 \(\pm\) 2.53%	43.53 \(\pm\) 0.23%	42.99 \(\pm\) 0.61%	45.00 \(\pm\) 0.54%

Table 3. Comparison Results of Node Classification in the Closed-World Setting with Budgets \(20|C|\)

In each column, the best results are highlighted in bold, and the second best is underlined.

Fig. 9.

5.3 Efficiency Analysis

To analyze the efficiency and scalability of the proposed method, we measured the time cost on Amazon-Computer and Reddit. The number of initial known classes is 3 and 16 for Amazon-Computer and Reddit, respectively. All experiments are done on an NVIDIA GTX 2080 GPU and the model is trained for the number of epochs recorded in their implementations with a budget of \(20C\). In Figure 9, we observe that the training time of OWGAL is 15x faster than IGP and 5x faster than AGE on Reddit, which is consistent with our complexity analysis. Random is the most efficient baseline but has an inferior accuracy performance to our method. AGE incorporating with GPN and OpenWGAL can better capture the novel class nodes but introduces additional time consuming. Moreover, since GPA requires pre-training reinforcement models on other datasets, the complexity of reinforcement learning makes it the most time-consuming on all datasets.

5.4 Ablation Study

5.4.1 Prototype Learning.

We introduce prototype learning in OWGAL to make known and unknown classes more separable in the metric space, as well as to handle insufficient training data of novel classes. To see whether the prototype method works, we replace the prototype learning with the general GNN classifier, i.e., GNNs followed by a fully-connected layer, where the output dimension is equal to \(|C_n|\) and other parts of OWGAL are not changed. As shown in Figure 10, prototype learning achieves better performance than the GNN classifier on both Micro-F1 and Macro-F1 metrics. It suggests that the prototype method better figures out unknown classes by adjusting the geometric distances between classes. And the problem of insufficient training data on novel-classes is also alleviated.

Fig. 10.

5.4.2 Prototype Learning under Different Metrics.

In prototype learning, we introduce two distance metrics to enhance the distinguishability of unknown-class nodes from known-class nodes. To evaluate the effectiveness of Euclidean distance and Cosine distance in prototype learning, we present the comparison results in Table 4. Cosine distance is superior to Euclidean distance on most datasets, while Euclidean distance has better performance on Coauthor-Physics and Product. Overall, the combination of the two metrics leads to a better performance.

Table 4.

Cosine	Euclidean	Metric	Photo	Computer	CS	Physics	Arxiv	Reddit	Product
\(\checkmark\)		Micro-F1	92.79%	86.03%	91.12%	92.22%	62.73%	89.14%	65.67%
\(\checkmark\)		Macro-F1	91.47%	80.63%	88.22%	88.73%	37.00%	80.75%	26.68%
	\(\checkmark\)	Micro-F1	92.04%	85.67%	90.90%	93.05%	62.58%	88.01%	66.06%
	\(\checkmark\)	Macro-F1	89.80%	75.87%	88.15%	90.09%	37.63%	78.78%	25.66%
\(\checkmark\)	\(\checkmark\)	Micro-F1	93.09%	86.39%	91.42%	93.11%	63.32%	88.41%	66.46%
\(\checkmark\)	\(\checkmark\)	Macro-F1	91.71%	80.79%	88.77%	90.36%	38.23%	80.12%	27.15%

Table 4. Comparison Results of Prototype Learning with Different Distance Metrics

5.4.3 Strategy of Node Selection.

OWGAL utilizes weighted sampling to simultaneously consider the node and class characteristics. To see whether the part contributes to the overall performance, we replace the weighted sampling strategy with two other variants: unbiased random sampling (Random) and top-rated \(k\) nodes selection (Topk), while the former randomly annotates nodes and the latter annotates nodes with the highest importance score according to Equation (10). As shown in Table 1, Random and Topk strategies have inconsistent performance across different datasets. And random sampling strategy is typically more effective than top-\(k\) choice, as “Random” have higher Macro-F1 scores than “Topk” on all datasets except for A-Computer. This observation also suggests that the OWGAL case is very different from standard GAL, and focusing solely on these anomalous nodes is no longer the best strategy. And our method is superior to these two variants as it achieves a balance between the data distribution and node importance. As the importance of class is taken into account, we avoid overly favoring the unimportant class judging by the uncertainty scores alone. Notably, the performance of Random variant is superior to the Random baseline in Table 2. The variant is based on prototype learning network whereas the Random baseline adopts a GNN classifier. This also demonstrates the advantage of using prototype learning.

Fig. 11.

Fig. 12.

5.4.4 Hyperparameter Analysis.

OWGAL have three key hyperparameters: \(\lambda\), \(\gamma\) and \(\tau\). By changing their values, we observe different model performances. As depicted in Figure 11, we find that: (1) \(\lambda\) controls the geometric distance adjusting between prototypes and have varied impact on different datasets. For Arxiv, the larger \(\lambda\) provides better performance while the opposite holds on Reddit. Hence an appropriate choice of \(\lambda\) further improves the performance of OWGAL. (2) \(\gamma\) balances the prediction uncertainty in metric and topology spaces, and a larger \(\gamma\) provides better performance on Arxiv but it is not true on Reddit. This may be because Reddit has a much more complex topology structure, and topology information is helpful to finding the critical nodes; (3) \(\tau\) is also an important hyperparameter which contributes most to the performance fluctuation among the three. It tunes the weights between data distribution and node importance. The smaller \(\tau\) creates a more skewed distribution, emphasizing nodes with higher node scores. This prevents OWGAL from being dominated by data distribution and hence achieving a greater balance.

Except for the aforementioned ones, other hyperparameters also have impacts to the performance, e.g., hidden dimension, number of GNN layers and length of LP. To analyze the model’s sensitivity to these hyperparameters, we depict the results in Figure 12. It is observed that (1) A larger hidden dimension leads to a better performance as the model is more powerful in representation, but an overly large dimension would lead to overfitting; (2) Deeper GNNs do not yield a better performance due to over-smoothing, and the optimal depth is two; (3) The longer the LP, the better the model performance is, and the result is consistent with our findings that novel class nodes are often topologically far away from labeled ones. However, a longer LP would incur additional time consumption, and we choose three hops to achieve a decent performance.

Fig. 13.

Fig. 14.

5.5 Case Study

To explore the effects of prototype learning, we visualize the heat map of the average distances from unknown-class nodes to known ones on the GNN classifier and OWGAL in Figure 13. Specifically, we utilize GNN classifier and OWGAL to learn the node representations, respectively, where only partial nodes from class 0 to 2 are labeled. We compute the cluster center by averaging node representations of each known class. Finally, we report the average Euclidean distance between nodes and known classes’ center for each class. We observe that prototype learning effectively reduces the diameter of the known-class cluster, creating a more compact representation than GNN classifiers. For the GNN classifier, node representations of novel classes (3–9) are very close to one of the known classes and far away from others, conforming to our observation in Figure 2. In contrast, the nodes of novel classes in OWGAL are well separated from the known-class nodes, demonstrating the power of prototype learning in boosting the uncertainty score of novel-class nodes.

Table 5.

Datasets	Metric	Random	Topk	Ours
Amazon-Photo	Micro-F1	91.91%	87.40%	93.09%
Amazon-Photo	Macro-F1	89.76%	84.25%	91.71%
Amazon-Computer	Micro-F1	85.72%	84.59%	86.39%
Amazon-Computer	Macro-F1	75.07%	76.41%	80.79%
Coauthor-CS	Micro-F1	88.11%	87.06%	91.42%
Coauthor-CS	Macro-F1	79.00%	85.74%	88.77%
Coauthor-Physics	Micro-F1	92.34%	92.15%	93.11%
Coauthor-Physics	Macro-F1	89.03%	89.26%	90.36%
Arxiv	Micro-F1	61.46%	62.13%	63.32%
Arxiv	Macro-F1	34.24%	32.97%	38.23%
Reddit	Micro-F1	86.63%	81.02%	88.41%
Reddit	Macro-F1	75.67%	66.83%	80.12%
Product	Micro-F1	64.48%	62.31%	66.46%
Product	Macro-F1	24.25%	23.31%	27.15%

Table 5. Ablation Study of Different Node Selection Strategies

We further investigate the evolution of labeled and unlabeled nodes to show the power of our node selection strategy, see Table 5. UMAP [12] method is applied to depict the node embeddings and prototype representations on Amazon-Computer, as shown in Figure 14, where blue points denote the labeled nodes and green triangles mean the selected unlabeled nodes. Pentagrams of different colors denote different prototypes. Since OWGAL deals with the geometric distance, the prototypes of known classes are far apart from each other throughout training, leaving most unknown-class nodes clustered, as shown in Epoch 1 (top-left figure). In that case, OWGAL prefers to select nodes located in the central region, e.g., purple nodes, as these nodes have higher entropy scores. With the progress of training, the labeling choice of OWGAL gradually moves away from the central region to nodes near the decision boundary. For example, since all classes are discovered in Epoch 14, the green triangles are all located on the boundaries across different classes, representing the most difficult instances for the current models to figure out. This shows that OWGAL is remarkably adaptable to the evolving challenges in GAL.

6 Conclusion

As far as we know, the proposed framework OWGAL is among the first to tackle the learning challenge on evolving graphs with insufficient labeled data and known classes. Featured by the prototype learning, LP, and weighted sampling, OWGAL presents an integrated framework which excels in novel-class and most-valuable node detection at the same time. Extensive experiments on four real-world large-scale graphs demonstrates the superior performance of OWGAL compared to the state-of-the-art.

References

[1]

Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2017. Active learning for graph embedding. arXiv:1705.05085. Retrieved from https://arxiv.org/abs/1705.05085

Abstract

1 Introduction

2 Related Work

2.1 Open-World Learning

2.2 Graph Active Learning

3 Preliminaries

3.1 Problem Definition

3.2 Prototype Representation

4 Method

4.1 Metric-Based Prototype Learning

4.2 Label Propagation

4.3 Node Selection and Model Training

4.4 Complexity Analysis

5 Experiment

5.1 Setup

5.1.1 Datasets.

5.1.2 Baselines.

5.1.3 Experiment Setting.

5.2 Performance Comparison

5.2.1 End-to-End Comparison.

5.2.2 Performance under Different Budgets.

5.2.3 Performance under Various Known Classes.

5.2.4 Performance under Closed-World Setting.

5.3 Efficiency Analysis

5.4 Ablation Study

5.4.1 Prototype Learning.

5.4.2 Prototype Learning under Different Metrics.

5.4.3 Strategy of Node Selection.

5.4.4 Hyperparameter Analysis.

5.5 Case Study

6 Conclusion

References

Cited By

Index Terms

Recommendations

OpenWGL: open-world graph learning for unseen class node classification

Robust Graph Meta-Learning for Weakly Supervised Few-Shot Node Classification

Zero-shot classification with unseen prototype learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations