research-article

Distil Knowledge from Natural Language

Authors:

Yongheng WangAuthors Info & Claims

ICCAI '22: Proceedings of the 8th International Conference on Computing and Artificial Intelligence

Pages 181 - 186

https://doi.org/10.1145/3532213.3532240

Published: 13 July 2022 Publication History

Abstract

Knowledge Distillation (KD) is a machine learning approach for model compression and acceleration, which is suitable for applications with limited computational resources. KD is typically performed by distilling and transferring the knowledge of large teacher models into smaller student models, to enhance the performance of the latter. Current KD models require supervision by a pretrained teacher model, whose training requires additional computational cost. In this work, we first analyze the output of the teacher, then propose a method to distill knowledge from natural language. On this basis, we propose Semantic-knowledge-based Teacher-free KD (ST-KD) to further advance model compression and acceleration methods. Our model was tested on the CIFAR10 and CIFAR100 image classification datasets, and it was experimentally demonstrated that it improved the performance of a variety of deep neural networks with virtually no additional computational cost.

References

[1]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

Digital Library

[2]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2015.

[3]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[4]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[5]

Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.

Digital Library

[6]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[7]

Wei Li, Xiatian Zhu, and Shaogang Gong. Person re-identification by deep joint learning of multi-loss classification. In International Joint Conference of Artificial Intelligence, 2017.

Digital Library

[8]

Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. In European Conference on Computer Vision, 2018.

[9]

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv e-print, 2016.

[10]

G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.

[11]

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.

[12]

J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer [13] learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.

[13]

R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1974–1982, 2017.

[14]

T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.

[15]

R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.

[16]

S.-I. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019.

[17]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.

[18]

T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems pp. 3111-3119.

[19]

T. Mikolov, W.-t. Yih, and G. Zweig, "Linguistic regularities in continuous space word representations," 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 746-751.

[20]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.

[21]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. arXiv:1607.04606.

[22]

Jean Charbonnier and Christian Wartena. Predicting Word Concreteness and Imagery. W19-0415, 2019.

[23]

C. Bucilua, R. Caruana, and A. Niculescu-Mizil, "Model compression," Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 535-541.

[24]

Rabinovich. E, B. Sznajder, A. Spector, I. Shnayderman, R. Aharonov, D. Konopnicki, and N. Slonim (2018, September). Learning Concept Abstractness Using Weak Supervision. ArXiv e-prints.

[25]

Rothe, S., S. Ebert, and H. Schütze (2016). Ultradense word embeddings by orthogonal transformation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 767–777.

[26]

A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," University of Toronto, Toronto, Technical Report TR-2009 2009.

[27]

Princeton University. WordNet [DB/OL] https://wordnet.princeton.edu/.

[28]

Zhendong Dong and Qiang Dong. HowNet - a hybrid language and knowledge resource. International Conference on Natural Language Processing and Knowledge Engineering, 2003.

[29]

Extended Edition of Synonym Cilin by Social Computing and Information Retrieval Research Center of Harbin Institute of Technology [EB/OL] ( 2015-09-13). http://www.datatang.com/data/42306/.

[30]

Jiang J and Conrath D. W. Semantic similarity based on corpus statistics and lexical taxonomy. Proc of the 10th International Conference on Research in Computational Linguistics, 1997.

[31]

Resnik. P. Using information content to evaluate semantic similarity in a Taxonomy. Proc of International Joint Conference on Artificial Intelligence, 1995: 448-453.

[32]

Jianping Gou, Baosheng Yu, Stephen John Maybank, Dacheng Tao. Knowledge Distillation: A Survey. arXiv:2006.05525.

[33]

Resnik. P. Using information content to evaluate semantic similarity in a Taxonomy. Proc of International Joint Conference on Artificial Intelligence, 1995: 448-453.

[34]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," Proceedings of the IEEE conference on computer vision and pattern recognition pp. 1492-1500.

[35]

G. Huang, Y. Sun, Z. Liu, D. Sedra, and K.Q. Weinberger, "Deep networks with stochastic depth," European conference on computer vision, Springer, pp. 646-661.

[36]

Y. Zhang, T. Xiang, T.M. Hospedales, and H. Lu, "Deep mutual learning," arXiv e-print, 2017.

[37]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Distil Knowledge from Natural Language
1. Computing methodologies

Recommendations

Online Ensemble Model Compression Using Knowledge Distillation
Computer Vision – ECCV 2020
Abstract
This paper presents a novel knowledge distillation based model compression framework consisting of a student ensemble. It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models. Each model ...
Collaborative multi-knowledge distillation under the influence of softmax regression representation: Collaborative multi-knowledge...
Abstract
Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task ...
Prune Your Model Before Distill It
Computer Vision – ECCV 2022
Abstract
Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferrable knowledge. In this work, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCAI '22: Proceedings of the 8th International Conference on Computing and Artificial Intelligence

March 2022

809 pages

ISBN:9781450396110

DOI:10.1145/3532213

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCAI '22

ICCAI '22: 2022 8th International Conference on Computing and Artificial Intelligence

March 18 - 21, 2022

Tianjin, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
45
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten