research-article

Channel-Wise Attention and Channel Combination for Knowledge Distillation

Authors:

Keon Myung LeeAuthors Info & Claims

RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Pages 72 - 76

https://doi.org/10.1145/3400286.3418273

Published: 25 November 2020 Publication History

RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Channel-Wise Attention and Channel Combination for Knowledge Distillation

Pages 72 - 76

Abstract
References

Abstract

Knowledge distillation is a strategy to build machine learning models efficiently by making use of knowledge embedded in a pretrained model. Teacher-student framework is a well-known one to use knowledge distillation, where a teacher network usually contains knowledge for a specific task and a student network is constructed in a simpler architecture inheriting the knowledge of the teacher network. This paper proposes a new approach that uses an attention mechanism to extract knowledge from a teacher network. The attention function plays the role of determining which channels of feature maps in the teacher network to be used for training the student network so that the student network can only learn useful features. This approach allows a new model to learn useful features considering the model complexity.

References

[1]

Alex Krizhevsky, Geoffrey Hinton, et al.2009. Learning multiple layers of features from tiny images. (2009).

[2]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[3]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[4]

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXivpreprint arXiv:1412.6550 (2014).

[5]

Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv: 1612.03928 (2016).

[6]

Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. 2019. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3779--3787.

Digital Library

[7]

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. 2019. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1921--1930.

[8]

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9163--9171.

[9]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint-arXiv:1409.0473 (2014)

[10]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint-arXiv:1508.04025 (2015).

[11]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[12]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[13]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020.A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

[14]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[15]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXivpreprint arXiv:1605.07146 (2016).

Index Terms

Channel-Wise Attention and Channel Combination for Knowledge Distillation
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning

Recommendations

Hierarchical Multi-Attention Transfer for Knowledge Distillation
Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the ...
Knowledge Distillation with Classmate
Advanced Intelligent Computing Technology and Applications
Abstract
Knowledge distillation, as a type of model compression algorithms, has been popularly adopted due to its easy implementation and effectiveness. However, transferring knowledge from a teacher network to a student one encounters a bottleneck. ...
Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms
Abstract
The success of deep learning has brought breakthroughs in many fields. However, the increased performance of deep learning models is often accompanied by an increase in their depth and width, which conflicts with the storage, energy consumption, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

October 2020

300 pages

ISBN:9781450380256

DOI:10.1145/3400286

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

RACS '20

Sponsor:

SIGAPP

RACS '20: International Conference on Research in Adaptive and Convergent Systems

October 13 - 16, 2020

Gwangju, Republic of Korea

Acceptance Rates

RACS '20 Paper Acceptance Rate 42 of 148 submissions, 28%;

Overall Acceptance Rate 393 of 1,581 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
72
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten