Abstract
How can we learn effective node representations on textual graphs? Graph Neural Networks (GNNs) that use Language Models (LMs) to encode textual information of graphs achieve state-of-the-art performance in many node classification tasks. Yet, combining GNNs with LMs has not been widely explored for practical deployments due to its scalability issues. In this work, we tackle this challenge by developing a Graph-Aware Distillation framework (GraD) to encode graph structures into an LM for graph-free, fast inference. Different from conventional knowledge distillation, GraD jointly optimizes a GNN teacher and a graph-free student over the graph’s nodes via a shared LM. This encourages the graph-free student to exploit graph information encoded by the GNN teacher while at the same time, enables the GNN teacher to better leverage textual information from unlabeled nodes. As a result, the teacher and the student models learn from each other to improve their overall performance. Experiments in eight node classification benchmarks in both transductive and inductive settings showcase GraD ’s superiority over existing distillation approaches for textual graphs. Our code and supplementary material are available at: https://github.com/cmavro/GRAD.
C. Mavromatis—Work done while interning at Amazon Web Services, Santa Clara.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For example, the inference cost of a single transformer layer is \({\mathcal {O}}(L^2d+Ld^2)\), where L is the sequence length and d is the number of hidden dimensions.
References
Ando, R., Zhang, T.: Learning on graph with laplacian regularization. In: NIPS (2006)
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: EMNLP-IJCNLP (2019)
Chien, E., et al.: Node feature extraction by self-supervised multi-scale neighborhood prediction. In: ICLR (2022)
Deng, X., Zhang, Z.: Graph-free knowledge distillation for graph neural networks. arXiv (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)
Dinh, T.A., Boef, J.D., Cornelisse, J., Groth, P.: E2EG: end-to-end node classification using graph topology and text-based node attributes. arXiv (2022)
Dong, W., Wu, J., Luo, Y., Ge, Z., Wang, P.: Node representation learning in graph via node-to-neighbourhood mutual information maximization. In: IEEE/CVF CVPR (2022)
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML (2017)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. In: IJCV (2021)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NeurIPS (2017)
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. arXiv (2015)
Hu, W., et al.: Open graph benchmark: datasets for machine learning on graphs. In: NeurIPS (2020)
Hu, Y., You, H., Wang, Z., Wang, Z., Zhou, E., Gao, Y.: Graph-MLP: node classification without message passing in graph (2021)
Huang, L., Ma, D., Li, S., Zhang, X., Wang, H.: Text level graph neural network for text classification. In: EMNLP (2019)
Ioannidis, V.N., et al.: Efficient and effective training of language and graph neural network models. arXiv (2022)
Jia, J., Benson, A.R.: Residual correlation in graph neural network regression. In: KDD (2020)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Li, C., et al.: AdsGNN: behavior-graph augmented relevance modeling in sponsored search. In: ACM SIGIR (2021)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv (2019)
Mavromatis, C., Karypis, G.: ReaRev: adaptive reasoning for question answering over knowledge graphs. arXiv (2022)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(1), 5485–5551 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv (2019)
Schlichtkrull, M., Kipf, T.N., Bloem, P., Berg, R.V.D., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: ESWC (2018)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
Xu, Y., Zhang, Y., Guo, W., Guo, H., Tang, R., Coates, M.: Graphsail: graph structure aware incremental learning for recommender systems. In: CIKM (2020)
Yan, B., Wang, C., Guo, G., Lou, Y.: Tinygnn: learning efficient graph neural networks. In: KDD (2020)
Yang, C., Liu, J., Shi, C.: Extract the knowledge of graph neural networks and go beyond it: an effective knowledge distillation framework. In: WWW (2021)
Yang, H., Ma, K., Cheng, J.: Rethinking graph regularization for graph neural networks. In: AAAI (2021)
Yang, J., et al.: Graphformers: GNN-nested transformers for representation learning on textual graph. In: NeurIPS (2021)
Yang, Y., Qiu, J., Song, M., Tao, D., Wang, X.: Distilling knowledge from graph convolutional networks. In: IEEE/CVF CVPR (2020)
Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: AAAI (2019)
Yasunaga, M., et al.: Deep bidirectional language-knowledge graph pretraining. In: NeurIPS (2022)
Yasunaga, M., Leskovec, J., Liang, P.: Linkbert: pretraining language models with document links. In: ACL (2022)
Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisiting knowledge distillation via label smoothing regularization. In: IEEE/CVF CVPR (2020)
Zhang, J., Zhang, H., Xia, C., Sun, L.: Graph-bert: only attention is needed for learning graph representations. arXiv (2020)
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K.: Be your own teacher: improve the performance of convolutional neural networks via self distillation. In: IEEE/CVF ICCV (2019)
Zhang, S., Liu, Y., Sun, Y., Shah, N.: Graph-less neural networks: teaching old MLPs new tricks via distillation. In: ICLR (2022)
Zhang, W., Deng, L., Zhang, L., Wu, D.: A survey on negative transfer. arXiv (2020)
Zhang, X., et al.: GreaseLM: graph REASoning enhanced language models. In: ICLR (2022)
Zhao, J., et al.: Learning on large-scale text-attributed graphs via variational inference. arXiv (2022)
Zheng, W., Huang, E.W., Rao, N., Katariya, S., Wang, Z., Subbian, K.: Cold brew: distilling graph node representations with incomplete or missing neighborhoods. In: ICLR (2022)
Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: NIPS (2003)
Zhu, J., et al.: Textgnn: improving text encoder via graph neural network in sponsored search. In: WWW (2021)
Acknowledgment
Part of this work was supported by NSF (1704074, 1757916, 1834251, 1834332). Access to research and computing facilities was provided by the College of Science & Engineering and the Minnesota Supercomputing Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Limitations and Ethical Statement
GraD relies on informative input node features to learn effective shared LMs (or MLPs) that can generalize to unseen nodes, which is the case in textual graphs. Thus, one limitation is that it is not certain how GraD generalizes to other graphs, e.g., to featureless graphs. Moreover as a knowledge distillation approach, GraD trades accuracy for computation efficiency and it cannot adapt to dynamic graphs with edge changes the same way as GNN could. To overcome biases encoded in the training graph, e.g., standard stereotypes in recommender graphs, GraD needs to be retrained over the new unbiased graph.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mavromatis, C. et al. (2023). Train Your Own GNN Teacher: Graph-Aware Distillation on Textual Graphs. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14171. Springer, Cham. https://doi.org/10.1007/978-3-031-43418-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-43418-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43417-4
Online ISBN: 978-3-031-43418-1
eBook Packages: Computer ScienceComputer Science (R0)