ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

Although deep learning-based audio-visual speech recognition (AVSR) systems recognize base closed-set categories well, extending their discerning ability to additional novel categories with limited labeled training data is challenging since the model easily over-fits. In this paper, we propose Prototype-based Co-Adaptation with Transformer (Proto-CAT), a multi-modal generalized few-shot learning (GFSL) method for AVSR systems. In other words, Proto-CAT learns to recognize a novel class multi-modal object with few-shot training data, while maintaining its ability on those base closed-set categories. The main idea is to transform the prototypes (i.e., class centers) by incorporating cross-modality complementary information and calibrating cross-category semantic differences. In particular, Proto-CAT co-adapts the embeddings from audio-visual and category levels, so that it generalizes its predictions on all categories dynamically. Proto-CAT achieves state-of-the-art performance on various AVSR-GFSL benchmarks. The code is available at https://github.com/ZhangYikaii/Proto-CAT.


doi: 10.21437/Interspeech.2022-652

Cite as: Zhang, Y.-K., Zhou, D.-W., Ye, H.-J., Zhan, D.-C. (2022) Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation. Proc. Interspeech 2022, 531-535, doi: 10.21437/Interspeech.2022-652

@inproceedings{zhang22k_interspeech,
  author={Yi-Kai Zhang and Da-Wei Zhou and Han-Jia Ye and De-Chuan Zhan},
  title={{Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={531--535},
  doi={10.21437/Interspeech.2022-652}
}