How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation

Authors

  • Zixiang Ding Meituan
  • Guoqing Jiang Meituan
  • Shuai Zhang Meituan
  • Lin Guo Meituan
  • Wei Lin Independent Researcher

DOI:

https://doi.org/10.1609/aaai.v38i16.29746

Keywords:

NLP: (Large) Language Models, CV: Learning & Optimization for CV

Abstract

We observe two phenomenons with respect to quantity and capacity: 1) more teacher is not always better for multi-teacher knowledge distillation, and 2) stronger teacher is not always better for single-teacher knowledge distillation. To trade off the quantity and capacity of teacher ensemble, in this paper, we propose a new distillation paradigm named Dynamic Knowledge Distillation (DynaKD) that learn an adaptive categorical distribution to stochastically employ a teacher from a teacher ensemble in each step, to transfer knowledge from teacher ensemble into student. DynaKD has three advantages: 1) it can preserve diversity of each teacher via one-to-one distillation manner instead of several-for-one, 2) it can make the best of powerful teacher via those multi-level assistant teachers in ensemble, and 3) it can also dynamically determine the importance of each teacher for various tasks. To verify the effectiveness of the proposed approach, we conduct extensive experiments for BERT compression on GLUE benchmark. Experimental results show that the proposed approach achieves state-of-the-art score compared to previous compression approaches on five out of seven downstream tasks, including pushing MRPC F1 and accuracy to 92.2 (1.4 point absolute improvement), RTE accuracy to 76.2 (2.8 point absolute improvement). Moreover, we conduct also extensive experiments for image classification on CIFAR-100. Similarly, DynaKD achieves also state-of-the-art performance.

Published

2024-03-24

How to Cite

Ding, Z., Jiang, G., Zhang, S., Guo, L., & Lin, W. (2024). How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17915-17923. https://doi.org/10.1609/aaai.v38i16.29746

Issue

Section

AAAI Technical Track on Natural Language Processing I