Abstract:
Knowledge distillation has proven to be an effective model compression method that exploits the knowledge from a teacher model for supervising a student model by minimizi...Show MoreMetadata
Abstract:
Knowledge distillation has proven to be an effective model compression method that exploits the knowledge from a teacher model for supervising a student model by minimizing the distribution difference between the knowledge and the prediction produced by the student model. In this work, we focus on improving its performance from the perspective of divergence measures. A general form representing a family of divergence measures is introduced and a novel learning paradigm that jointly optimizes multiple measures is proposed by formalizing it as a multi-objective learning problem. Conditioned on Pareto optimality, the weights of different divergences are tuned in an automated way during training. Extensive experiments on multiple datasets show the proposed method can significantly improve the performance of student networks compared to the state-of-the-art methods for knowledge distillation. Codes are available at https://github.com/CML00/MoDiv.
Published in: IEEE Signal Processing Letters ( Volume: 28)