skip to main content
research-article

Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks

Published: 22 October 2021 Publication History

Abstract

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly improve and accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the target network using the starting point as references (SPAR), have been studied. In this article, we propose a novel regularized transfer learning framework \(\operatorname{DELTA}\), namely DEep Learning Transfer using Feature Map with Attention. Instead of constraining the weights of neural network, \(\operatorname{DELTA}\) aims at preserving the outer layer outputs of the source network. Specifically, in addition to minimizing the empirical loss, \(\operatorname{DELTA}\) aligns the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in a supervised learning manner. We evaluate \(\operatorname{DELTA}\) with the state-of-the-art algorithms, including \(L^2\) and \(\emph {L}^2\text{-}SP\). The experiment results show that our method outperforms these baselines with higher accuracy for new tasks. Code has been made publicly available.

References

[1]
Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, and Jun Huan. 2019. DELTA: Deep learning transfer using feature map with attention for convolutional networks. In Proceedings of the International Conference on Learning Representations.
[2]
C. Hsu and C. Lin. 2018. CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Transactions on Multimedia 20, 2 (2018), 421–429.
[3]
X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. 2015. Rating image aesthetics using deep learning. IEEE Transactions on Multimedia 17, 11 (2015), 2021–2034.
[4]
N. Kumar and A. Sethi. 2016. Fast learning-based single image super-resolution. IEEE Transactions on Multimedia 18, 8 (2016), 1504–1515.
[5]
D. Guo, W. Li, and X. Fang. 2018. Fully convolutional network for multi-scale temporal action proposals. IEEE Transactions on Multimedia 20, 12 (2018), 3428–3438.
[6]
X. Lin, J. Liu, and X. Kang. 2016. Audio recapture detection with convolutional neural networks. IEEE Transactions on Multimedia 18, 8 (2016), 1480–1487.
[7]
N. Takahashi, M. Gygli, and L. Van Gool. 2018. AENet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia 20, 3 (2018), 513–524.
[8]
V. E. Liong, J. Lu, Y. Tan, and J. Zhou. 2017. Deep video hashing. IEEE Transactions on Multimedia 19, 6 (2017), 1209–1219.
[9]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems. 3320–3328.
[10]
Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning?arXiv:1608.08614. Retrieved from https://arxiv.org/abs/1608.08614.
[11]
M. Soltanian, S. Amini, and S. Ghaemmaghami. 2020. Spatio-temporal VLAD encoding of visual events using temporal ordering of the mid-level deep semantics. IEEE Transactions on Multimedia 22, 7 (2020), 1769–1784.
[12]
Y. Zha, T. Ku, Y. Li, and P. Zhang. 2020. Deep position-sensitive tracking. IEEE Transactions on Multimedia 22, 1 (2020), 96–107.
[13]
P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona. 2018. Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Transactions on Multimedia 20, 5 (2018), 1051–1061.
[14]
S. S. Mukherjee and N. M. Robertson. 2015. Deep head pose: Gaze-direction estimation in multi-modal video. IEEE Transactions on Multimedia 17, 11 (2015), 2094–2107.
[15]
Hong Liu, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. 2019. Towards understanding the transferability of deep representations. arXiv:1909.12031. Retrieved from https://arxiv.org/abs/1909.12031.
[16]
Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations.
[17]
Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017) 2935–2947.
[18]
Phillippe Rigollet and Jan-Christian Hütter. 2015. High dimensional statistics. Lecture notes for course 18S997 813 (2015), 814.
[19]
Xuhong Li, Yves Grandvalet, and Franck Davoine. 2018. Explicit inductive bias for transfer learning with convolutional networks. In Proceedings of the 35th International Conference on Machine Learning.
[20]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[21]
Mehmet Aygun, Yusuf Aytar, and Hazim Kemal Ekenel. 2017. Exploiting convolution filter patterns for transfer learning. In Proceedings of the International Conference on Computer Vision Workshops. 2674–2680.
[22]
Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. 2019. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10705–10714.
[23]
Jun Deng, Wei Dong, Richard Socher, Li-Jia Li, Kuntai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
[24]
Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. 2011. Novel dataset for fine-grained image categorization. In Proceedings of the First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO.
[25]
Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7 (2006), 1527–1554.
[26]
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems. 153–160.
[27]
Kaiming he, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[28]
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying interpretability of deep visual representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3319–3327.
[29]
Rich Caruana. 1997. Multi-task learning. Machine Learning 28, 1 (1997), 41–75.
[30]
Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
[31]
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning. 647–655.
[32]
Weifeng Ge and Yizhou Yu. 2017. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10–19.
[33]
Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. 2018. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4109–4118.
[34]
Yinghua Zhang, Yu Zhang, and Qiang Yang. 2019. Parameter transfer unit for deep neural networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 82–95.
[35]
Xinyang Chen, Sinan Wang, Bo Fu, Mingsheng Long, and Jianmin Wang. 2019. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. In Proceedings of the Advances in Neural Information Processing Systems. 1906–1916.
[36]
R. Wan, H. Xiong, X. Li, Z. Zhu, and J. Huan. 2019. Towards making deep transfer learning never hurt. In Proceedings of the 2019 IEEE International Conference on Data Mining. 578–587.
[37]
Xingjian Li, Haoyi Xiong, Haozhe An, Cheng-Zhong Xu, and Dejing Dou. 2020. RIFLE: Backpropagation in depth for deep transfer learning through re-initializing the fully-connected LayEr. In Proceedings of the International Conference on Machine Learning. 6010–6019.
[38]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531.
[39]
S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo. 2019. Holistic CNN compression via low-rank decomposition with knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 12 (2019), 2889–2905.
[40]
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 535–541.
[41]
Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Proceedings of the Advances in Neural Information Processing Systems. 2654–2662.
[42]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv:1412.6550. Retrieved from https://arxiv.org/abs/1412.6550.
[43]
Sergey Zagoruyko and Nikos Komodakis. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR’17).
[44]
C. Yan, L. Li, C. Zhang, B. Liu, Y. Zhang, and Q. Dai. 2019. Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 21, 10 (2019), 2675–2685.
[45]
M. Yuan and Y. Peng. 2020. CKD: Cross-task knowledge distillation for text-to-image synthesis. IEEE Transactions on Multimedia 22, 8 (2020), 1955–1968.
[46]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114, 13 (2017), 3521–3526.
[47]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems. 2204–2212.
[48]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.
[49]
B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia 19, 6 (2017), 1245–1256.
[50]
P. Rodríguez, D. Velazquez, G. Cucurull, J. M. Gonfaus, F. X. Roca, and J. Gonzàlez. 2020. Pay Attention to the Activations: A modular attention mechanism for fine-grained image recognition. IEEE Transactions on Multimedia 22, 2 (2020), 502–514.
[51]
Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1–3 (2000), 17–42.
[52]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR’15).
[53]
Yequan Wang, Minlie Huang, Li Zhao, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 606–615.
[54]
C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2020. Corrections to “STAT: Spatial-Temporal Attention Mechanism for Video Captioning”. IEEE Transactions on Multimedia 22, 3 (2020), 830–830.
[55]
Z. Yang, Y. Li, J. Yang, and J. Luo. 2019. Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences. IEEE Transactions on Circuits and Systems for Video Technology 29, 8 (2019), 2405–2415.
[56]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.
[57]
Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the International Conference on Computer Vision.
[58]
Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition.

Cited By

View all
  • (2024)A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
  • (2024)Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
  • (2024)An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 3
June 2022
494 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3485152
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021
Accepted: 01 July 2021
Received: 01 February 2021
Published in TKDD Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Transfer learning
  2. framework
  3. algorithms
  4. knowledge distillation

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key Research and Development Program of China
  • Science and Technology Development Fund of Macau SAR
  • GuangDong Basic and Applied Basic Research Foundation
  • Key-Area Research and Development Program of Guangdong Province

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)211
  • Downloads (Last 6 weeks)16
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
  • (2024)Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
  • (2024)An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
  • (2024)Package Arrival Time Prediction via Knowledge Distillation Graph Neural NetworkACM Transactions on Knowledge Discovery from Data10.1145/364303318:5(1-19)Online publication date: 28-Feb-2024
  • (2024)Enhancing Learning in Fine-Tuned Transfer Learning for Rotating Machinery via Negative Transfer MitigationIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.348020173(1-13)Online publication date: 2024
  • (2024)Disparity-constrained Knowledge Distillation for Cyber-Physical Systems and Edge Devices2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868709(1-7)Online publication date: 21-Nov-2024
  • (2024)Ensuring cross-device portability of electromagnetic side-channel analysis for digital forensicsForensic Science International: Digital Investigation10.1016/j.fsidi.2023.30168448(301684)Online publication date: Mar-2024
  • (2024)Kidney Tumor Classification on CT images using Self-supervised LearningComputers in Biology and Medicine10.1016/j.compbiomed.2024.108554176(108554)Online publication date: Jun-2024
  • (2023)Spatial–Temporal Traffic Modeling With a Fusion Graph Reconstructed by Tensor DecompositionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.331413425:2(1749-1760)Online publication date: 22-Sep-2023
  • (2023)Measures and Optimization for Robustness and Vulnerability in Disconnected NetworksIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.327997918(3350-3362)Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media