skip to main content
10.1145/3532213.3532240acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccaiConference Proceedingsconference-collections
research-article

Distil Knowledge from Natural Language

Published: 13 July 2022 Publication History

Abstract

Knowledge Distillation (KD) is a machine learning approach for model compression and acceleration, which is suitable for applications with limited computational resources. KD is typically performed by distilling and transferring the knowledge of large teacher models into smaller student models, to enhance the performance of the latter. Current KD models require supervision by a pretrained teacher model, whose training requires additional computational cost. In this work, we first analyze the output of the teacher, then propose a method to distill knowledge from natural language. On this basis, we propose Semantic-knowledge-based Teacher-free KD (ST-KD) to further advance model compression and acceleration methods. Our model was tested on the CIFAR10 and CIFAR100 image classification datasets, and it was experimentally demonstrated that it improved the performance of a variety of deep neural networks with virtually no additional computational cost.

References

[1]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
[2]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2015.
[3]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[4]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[5]
Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
[6]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[7]
Wei Li, Xiatian Zhu, and Shaogang Gong. Person re-identification by deep joint learning of multi-loss classification. In International Joint Conference of Artificial Intelligence, 2017.
[8]
Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. In European Conference on Computer Vision, 2018.
[9]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv e-print, 2016.
[10]
G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.
[11]
A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
[12]
J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer [13] learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
[13]
R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1974–1982, 2017.
[14]
T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.
[15]
R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
[16]
S.-I. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019.
[17]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[18]
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems pp. 3111-3119.
[19]
T. Mikolov, W.-t. Yih, and G. Zweig, "Linguistic regularities in continuous space word representations," 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 746-751.
[20]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
[21]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. arXiv:1607.04606.
[22]
Jean Charbonnier and Christian Wartena. Predicting Word Concreteness and Imagery. W19-0415, 2019.
[23]
C. Bucilua, R. Caruana, and A. Niculescu-Mizil, "Model compression," Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 535-541.
[24]
Rabinovich. E, B. Sznajder, A. Spector, I. Shnayderman, R. Aharonov, D. Konopnicki, and N. Slonim (2018, September). Learning Concept Abstractness Using Weak Supervision. ArXiv e-prints.
[25]
Rothe, S., S. Ebert, and H. Schütze (2016). Ultradense word embeddings by orthogonal transformation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 767–777.
[26]
A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," University of Toronto, Toronto, Technical Report TR-2009 2009.
[27]
Princeton University. WordNet [DB/OL] https://wordnet.princeton.edu/.
[28]
Zhendong Dong and Qiang Dong. HowNet - a hybrid language and knowledge resource. International Conference on Natural Language Processing and Knowledge Engineering, 2003.
[29]
Extended Edition of Synonym Cilin by Social Computing and Information Retrieval Research Center of Harbin Institute of Technology [EB/OL] ( 2015-09-13). http://www.datatang.com/data/42306/.
[30]
Jiang J and Conrath D. W. Semantic similarity based on corpus statistics and lexical taxonomy. Proc of the 10th International Conference on Research in Computational Linguistics, 1997.
[31]
Resnik. P. Using information content to evaluate semantic similarity in a Taxonomy. Proc of International Joint Conference on Artificial Intelligence, 1995: 448-453.
[32]
Jianping Gou, Baosheng Yu, Stephen John Maybank, Dacheng Tao. Knowledge Distillation: A Survey. arXiv:2006.05525.
[33]
Resnik. P. Using information content to evaluate semantic similarity in a Taxonomy. Proc of International Joint Conference on Artificial Intelligence, 1995: 448-453.
[34]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," Proceedings of the IEEE conference on computer vision and pattern recognition pp. 1492-1500.
[35]
G. Huang, Y. Sun, Z. Liu, D. Sedra, and K.Q. Weinberger, "Deep networks with stochastic depth," European conference on computer vision, Springer, pp. 646-661.
[36]
Y. Zhang, T. Xiang, T.M. Hospedales, and H. Lu, "Deep mutual learning," arXiv e-print, 2017.
[37]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  1. Distil Knowledge from Natural Language

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCAI '22: Proceedings of the 8th International Conference on Computing and Artificial Intelligence
    March 2022
    809 pages
    ISBN:9781450396110
    DOI:10.1145/3532213
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Image classification
    2. Knowledge distillation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICCAI '22

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 45
      Total Downloads
    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media