Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly improve and accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the target network using the starting point as references (SPAR), have been studied. In this article, we propose a novel regularized transfer learning framework \(\operatorname{DELTA}\), namely DEep Learning Transfer using Feature Map with Attention. Instead of constraining the weights of neural network, \(\operatorname{DELTA}\) aims at preserving the outer layer outputs of the source network. Specifically, in addition to minimizing the empirical loss, \(\operatorname{DELTA}\) aligns the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in a supervised learning manner. We evaluate \(\operatorname{DELTA}\) with the state-of-the-art algorithms, including \(L^2\) and \(\emph {L}^2\text{-}SP\). The experiment results show that our method outperforms these baselines with higher accuracy for new tasks. Code has been made publicly available.


  A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
  Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
  An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
  • Show More Cited By



Funding Sources

  • National Key Research and Development Program of China
  • Science and Technology Development Fund of Macau SAR
  • GuangDong Basic and Applied Basic Research Foundation
  • Key-Area Research and Development Program of Guangdong Province


  A Double-layer Stacked Gate Recurrent Unit with Self-Attention Residual Model for Knowledge TracingProceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Education Digitalization and Computer Science10.1145/3686424.3686454(171-176)Online publication date: 26-Jul-2024
  Co-occurrence Order-preserving Pattern Mining with Keypoint Alignment for Time SeriesACM Transactions on Management Information Systems10.1145/365845015:2(1-27)Online publication date: 12-Jun-2024
  An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946620:7(1-23)Online publication date: 25-Apr-2024
  Package Arrival Time Prediction via Knowledge Distillation Graph Neural NetworkACM Transactions on Knowledge Discovery from Data10.1145/364303318:5(1-19)Online publication date: 28-Feb-2024
  Enhancing Learning in Fine-Tuned Transfer Learning for Rotating Machinery via Negative Transfer MitigationIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.348020173(1-13)Online publication date: 2024
  Disparity-constrained Knowledge Distillation for Cyber-Physical Systems and Edge Devices2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868709(1-7)Online publication date: 21-Nov-2024
  Ensuring cross-device portability of electromagnetic side-channel analysis for digital forensicsForensic Science International: Digital Investigation10.1016/j.fsidi.2023.30168448(301684)Online publication date: Mar-2024
  Kidney Tumor Classification on CT images using Self-supervised LearningComputers in Biology and Medicine10.1016/j.compbiomed.2024.108554176(108554)Online publication date: Jun-2024
  Spatial–Temporal Traffic Modeling With a Fusion Graph Reconstructed by Tensor DecompositionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.331413425:2(1749-1760)Online publication date: 22-Sep-2023
  Measures and Optimization for Robustness and Vulnerability in Disconnected NetworksIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.327997918(3350-3362)Online publication date: 1-Jan-2023
  • Show More Cited By

Share this Publication link

Share on social media