research-article

Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks

Authors:

Cheng-Zhong Xu,

Dejing DouAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 3

Article No.: 42, Pages 1 - 20

https://doi.org/10.1145/3473912

Published: 22 October 2021 Publication History

Abstract

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly improve and accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the target network using the starting point as references (SPAR), have been studied. In this article, we propose a novel regularized transfer learning framework \(\operatorname{DELTA}\), namely DEep Learning Transfer using Feature Map with Attention. Instead of constraining the weights of neural network, \(\operatorname{DELTA}\) aims at preserving the outer layer outputs of the source network. Specifically, in addition to minimizing the empirical loss, \(\operatorname{DELTA}\) aligns the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in a supervised learning manner. We evaluate \(\operatorname{DELTA}\) with the state-of-the-art algorithms, including \(L^2\) and \(\emph {L}^2\text{-}SP\). The experiment results show that our method outperforms these baselines with higher accuracy for new tasks. Code has been made publicly available.

References

[1]

Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, and Jun Huan. 2019. DELTA: Deep learning transfer using feature map with attention for convolutional networks. In Proceedings of the International Conference on Learning Representations.

[2]

C. Hsu and C. Lin. 2018. CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Transactions on Multimedia 20, 2 (2018), 421–429.

Digital Library

[3]

X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. 2015. Rating image aesthetics using deep learning. IEEE Transactions on Multimedia 17, 11 (2015), 2021–2034.

Digital Library

[4]

N. Kumar and A. Sethi. 2016. Fast learning-based single image super-resolution. IEEE Transactions on Multimedia 18, 8 (2016), 1504–1515.

Digital Library

[5]

D. Guo, W. Li, and X. Fang. 2018. Fully convolutional network for multi-scale temporal action proposals. IEEE Transactions on Multimedia 20, 12 (2018), 3428–3438.

Digital Library

[6]

X. Lin, J. Liu, and X. Kang. 2016. Audio recapture detection with convolutional neural networks. IEEE Transactions on Multimedia 18, 8 (2016), 1480–1487.

Digital Library

[7]

N. Takahashi, M. Gygli, and L. Van Gool. 2018. AENet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia 20, 3 (2018), 513–524.

Digital Library

[8]

V. E. Liong, J. Lu, Y. Tan, and J. Zhou. 2017. Deep video hashing. IEEE Transactions on Multimedia 19, 6 (2017), 1209–1219.

Digital Library

[9]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems. 3320–3328.

Digital Library

[10]

Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning?arXiv:1608.08614. Retrieved from https://arxiv.org/abs/1608.08614.

[11]

M. Soltanian, S. Amini, and S. Ghaemmaghami. 2020. Spatio-temporal VLAD encoding of visual events using temporal ordering of the mid-level deep semantics. IEEE Transactions on Multimedia 22, 7 (2020), 1769–1784.

[12]

Y. Zha, T. Ku, Y. Li, and P. Zhang. 2020. Deep position-sensitive tracking. IEEE Transactions on Multimedia 22, 1 (2020), 96–107.

Digital Library

[13]

P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona. 2018. Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Transactions on Multimedia 20, 5 (2018), 1051–1061.

[14]

S. S. Mukherjee and N. M. Robertson. 2015. Deep head pose: Gaze-direction estimation in multi-modal video. IEEE Transactions on Multimedia 17, 11 (2015), 2094–2107.

Digital Library

[15]

Hong Liu, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. 2019. Towards understanding the transferability of deep representations. arXiv:1909.12031. Retrieved from https://arxiv.org/abs/1909.12031.

[16]

Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations.

[17]

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017) 2935–2947.

Digital Library

[18]

Phillippe Rigollet and Jan-Christian Hütter. 2015. High dimensional statistics. Lecture notes for course 18S997 813 (2015), 814.

[19]

Xuhong Li, Yves Grandvalet, and Franck Davoine. 2018. Explicit inductive bias for transfer learning with convolutional networks. In Proceedings of the 35th International Conference on Machine Learning.

[20]

Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[21]

Mehmet Aygun, Yusuf Aytar, and Hazim Kemal Ekenel. 2017. Exploiting convolution filter patterns for transfer learning. In Proceedings of the International Conference on Computer Vision Workshops. 2674–2680.

[22]

Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. 2019. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10705–10714.

[23]

Jun Deng, Wei Dong, Richard Socher, Li-Jia Li, Kuntai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255.

[24]

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. 2011. Novel dataset for fine-grained image categorization. In Proceedings of the First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO.

[25]

Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7 (2006), 1527–1554.

Digital Library

[26]

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems. 153–160.

Digital Library

[27]

Kaiming he, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[28]

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying interpretability of deep visual representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3319–3327.

[29]

Rich Caruana. 1997. Multi-task learning. Machine Learning 28, 1 (1997), 41–75.

Digital Library

[30]

Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.

Digital Library

[31]

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning. 647–655.

Digital Library

[32]

Weifeng Ge and Yizhou Yu. 2017. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10–19.

[33]

Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. 2018. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4109–4118.

[34]

Yinghua Zhang, Yu Zhang, and Qiang Yang. 2019. Parameter transfer unit for deep neural networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 82–95.

[35]

Xinyang Chen, Sinan Wang, Bo Fu, Mingsheng Long, and Jianmin Wang. 2019. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. In Proceedings of the Advances in Neural Information Processing Systems. 1906–1916.

Digital Library

[36]

R. Wan, H. Xiong, X. Li, Z. Zhu, and J. Huan. 2019. Towards making deep transfer learning never hurt. In Proceedings of the 2019 IEEE International Conference on Data Mining. 578–587.

[37]

Xingjian Li, Haoyi Xiong, Haozhe An, Cheng-Zhong Xu, and Dejing Dou. 2020. RIFLE: Backpropagation in depth for deep transfer learning through re-initializing the fully-connected LayEr. In Proceedings of the International Conference on Machine Learning. 6010–6019.

[38]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531.

[39]

S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo. 2019. Holistic CNN compression via low-rank decomposition with knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 12 (2019), 2889–2905.

[40]

Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 535–541.

Digital Library

[41]

Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Proceedings of the Advances in Neural Information Processing Systems. 2654–2662.

Digital Library

[42]

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv:1412.6550. Retrieved from https://arxiv.org/abs/1412.6550.

[43]

Sergey Zagoruyko and Nikos Komodakis. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR’17).

[44]

C. Yan, L. Li, C. Zhang, B. Liu, Y. Zhang, and Q. Dai. 2019. Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 21, 10 (2019), 2675–2685.

Digital Library

[45]

M. Yuan and Y. Peng. 2020. CKD: Cross-task knowledge distillation for text-to-image synthesis. IEEE Transactions on Multimedia 22, 8 (2020), 1955–1968.

[46]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114, 13 (2017), 3521–3526.

[47]

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems. 2204–2212.

Digital Library

[48]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.

Digital Library

[49]

B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia 19, 6 (2017), 1245–1256.

Digital Library

[50]

P. Rodríguez, D. Velazquez, G. Cucurull, J. M. Gonfaus, F. X. Roca, and J. Gonzàlez. 2020. Pay Attention to the Activations: A modular attention mechanism for fine-grained image recognition. IEEE Transactions on Multimedia 22, 2 (2020), 502–514.

Digital Library

[51]

Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1–3 (2000), 17–42.

[52]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR’15).

[53]

Yequan Wang, Minlie Huang, Li Zhao, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 606–615.

Digital Library

[54]

C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2020. Corrections to “STAT: Spatial-Temporal Attention Mechanism for Video Captioning”. IEEE Transactions on Multimedia 22, 3 (2020), 830–830.

[55]

Z. Yang, Y. Li, J. Yang, and J. Luo. 2019. Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences. IEEE Transactions on Circuits and Systems for Video Technology 29, 8 (2019), 2405–2415.

[56]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.

[57]

Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the International Conference on Computer Vision.

[58]

Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition.

Cited By

Qian LLuo X(2024)A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
https://dl.acm.org/doi/10.1145/3686424.3686454
Wu YWang ZLi YGuo YJiang HZhu XWu X(2024)Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3658450
Gao LGuo ZGuan L(2024)An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3649466
Show More Cited By

Index Terms

Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning
      2. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks

Recommendations

“In-Network Ensemble”: Deep Ensemble Learning with Diversified Knowledge Distillation
Ensemble learning is a widely used technique to train deep convolutional neural networks (CNNs) for improved robustness and accuracy. While existing algorithms usually first train multiple diversified networks and then assemble these networks as an ...
Deep representation-based transfer learning for deep neural networks
Abstract
In recent years, deep neural networks (DNNs) have become the de facto models for practically all visual tasks and most temporal analysis tasks due to the abundance of available labeled data and advances in computational resources. Deep ...
Highlights
- A deep representation-based transfer learning method is proposed for knowledge transfer between deep neural networks.
Transfer Learning for Convolutional Neural Networks in Tiny Deep Learning Environments
PCI '22: Proceedings of the 26th Pan-Hellenic Conference on Informatics

Tiny Machine Learning (TinyML) and Transfer Learning (TL) are two widespread methods of successfully deploying ML models to resource-starving devices. Tiny ML provides compact models, that can run on resource-constrained environments, while TL ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 3

June 2022

494 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3485152

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Accepted: 01 July 2021

Received: 01 February 2021

Published in TKDD Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key Research and Development Program of China
Science and Technology Development Fund of Macau SAR
GuangDong Basic and Applied Basic Research Foundation
Key-Area Research and Development Program of Guangdong Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
1,500
Total Downloads

Downloads (Last 12 months)211
Downloads (Last 6 weeks)16

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qian LLuo X(2024)A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
https://dl.acm.org/doi/10.1145/3686424.3686454
Wu YWang ZLi YGuo YJiang HZhu XWu X(2024)Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3658450
Gao LGuo ZGuan L(2024)An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3649466
Zhang LLiu YZeng ZCao YWu XXu YShen ZCui L(2024)Package Arrival Time Prediction via Knowledge Distillation Graph Neural NetworkACM Transactions on Knowledge Discovery from Data10.1145/364303318:5(1-19)Online publication date: 28-Feb-2024
https://dl.acm.org/doi/10.1145/3643033
Pavan Kumar MTu ZChen HChen K(2024)Enhancing Learning in Fine-Tuned Transfer Learning for Rotating Machinery via Negative Transfer MitigationIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.348020173(1-13)Online publication date: 2024
https://doi.org/10.1109/TIM.2024.3480201
Sunami KLall B(2024)Disparity-constrained Knowledge Distillation for Cyber-Physical Systems and Edge Devices2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868709(1-7)Online publication date: 21-Nov-2024
https://doi.org/10.1109/AKGEC62572.2024.10868709
Navanesan LLe-Khac NScanlon MDe Zoysa KSayakkara A(2024)Ensuring cross-device portability of electromagnetic side-channel analysis for digital forensicsForensic Science International: Digital Investigation10.1016/j.fsidi.2023.30168448(301684)Online publication date: Mar-2024
https://doi.org/10.1016/j.fsidi.2023.301684
Özbay EÖzbay FGharehchopogh F(2024)Kidney Tumor Classification on CT images using Self-supervised LearningComputers in Biology and Medicine10.1016/j.compbiomed.2024.108554176(108554)Online publication date: Jun-2024
https://doi.org/10.1016/j.compbiomed.2024.108554
Li QYang XWang YWu YHe D(2023)Spatial–Temporal Traffic Modeling With a Fusion Graph Reconstructed by Tensor DecompositionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.331413425:2(1749-1760)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1109/TITS.2023.3314134
Zhu LBao QZhang Z(2023)Measures and Optimization for Robustness and Vulnerability in Disconnected NetworksIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.327997918(3350-3362)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIFS.2023.3279979
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents